The AfWN as an example of DH research and development

Prof. Sonja Bosch delivering her segment of a presentation at the 7th CSIR Conference

Prof. Sonja Bosch, node co-manager and founding collaborator in the African Wordnet (AfWN) Project recently participated in the 7th CSIR Conference on 11 and 12 November 2020. In a collaboration between the South African Centre for Digital Language Resources (SADiLaR), UNISA and the CSIR entitled Hallo Digital Humanities: A reshaping of the Humanities in the fourth industrial revolution, Prof Bosch presented a segment to highlight the modern approach to research and development taken in the AfWN Project. Her presentation is entitled The African Wordnet – an example of DH-related research and development and you can watch the full conference presentation here.

[Original text by Marissa Griesel, 2021-03-10]

Developing digital African language resources

Working with government on the South African Centre for Digital Language Resources (SADiLaR), Unisa’s College of Human Sciences demonstrates how serious it is about developing African languages and contributing to the development of South African and African indigenous knowledge systems.

SADiLaR, a national centre supported by the Department of Science and Innovation (DSI), form part of the South African Research Infrastructure Roadmap (SARIR). According to the SADiLaR website, “SARIR is a high-level strategic and systemic intervention to provide research infrastructure across the entire public research system, building on existing capabilities and strengths, and drawing on future needs.”

SADiLaR has an enabling function, with a focus on all official languages of South Africa, supporting research and development in the domains of language technologies and language-related studies in the humanities and social sciences. The centre supports the creation, management and distribution of digital language resources, as well as applicable software, which are freely available for research purposes through the Language Resource Catalogue.

SADiLaR runs two programmes. The first is a digitisation programme, which entails the systematic creation of relevant digital text, speech and multi-modal resources related to all official languages of South Africa. The development of appropriate natural language processing software tools for research and development purposes are included as part of the digitisation programme.

The second is a digital humanities programme, which facilitates the building of research capacity by promoting and supporting the use of digital data and innovative methodological approaches within the humanities and social sciences.


Unisa’s African languages in a unique position

The centre’s website states that its clients are academic scholars and professionals in all domains of humanities and social sciences, language technologies, natural language processing, computer science, as well as potential end-users in education, business and industry. SADiLaR is also a multi-partner entity with the North-West University functioning as host, as well as hub of a network of linked nodes, one of which is the Unisa Department of African Languages. The Unisa node has two node managers, Prof Sonja Bosch and Prof Mampaka Lydia Mojapelo.

According to Bosch, the Unisa Department of African Languages is working in close collaboration on the creation, management and distribution of digital language resources, which are made freely available.

“The Unisa node of SADiLaR, linked to the Department of African Languages, specialises in language development since this department is in the unique position of offering all nine official African languages. The Unisa node stands on two legs; the first leg is the African Wordnet (AfWN) and the second one is the Multilingual Linguistic Terminology.”

She explains that SADiLaR has contributed considerably to the sustainability of these projects at Unisa. “Instead of time and effort being spent on writing short-term funding proposals, longer-term arrangements with SADiLaR ensure a more stable research and development environment.”

In this way, Mojapelo states that the project teams can concentrate on the development and quality assurance of the wordnets and the linguistic terminology. “Linguists involved in this project also benefit in the sense that they can now be supplied with all essential equipment such as laptops, have access to the relevant software, and attend dedicated training sessions. Furthermore, researchers have the opportunity to publish their research findings.”

She says that the language resources that are being developed are managed with great ease via the SADiLaR server, while the hosting of the wordnet editing tool on this server means stability and continuous technical support.

A training workshop introducing WordnetLoom as an editing tool during 2019 was facilitated by international experts in the field of wordnet development, and was well attended by project members:
Front row: Justina Wieczorek (PolNet developer and facilitator of the workshop, University of Wroclaw, Poland), Dr Janek Wieczorek (WordnetLoom development team member and facilitator of the workshop, University of Wroclaw, Poland), Mmasibidi Setaka (Sesotho), Angelinah Dazela (isiXhosa) and Valencia Wagner (Setswana)
Middle row: Prof Sonja Bosch (Node manager), Matseleng Mabusela (Sesotho sa Leboa), Mercy Mahwasane (Tshivenḓa), Dr Inie Kock (Sesotho), Taki Matamela (Tshivenḓa), Opelo Thole (Setswana), Celimpilo Dladla (isiZulu) and Lindelwa Mahonga (isiZulu)
Back row: Delvah Mathevula (Xitsonga), Respect Mlambo (Xitsonga), Prof Mampaka Lydia Mojapelo (Node manager, Sesotho sa Leboa), Mlamli Diko (isiXhosa), Dr Jurie le Roux (Setswana), Dr Celani Zwane (isiZulu) and Prof Stanley Madonsela (Siswati)

Making multilingual linguistic terminology freely available

Mojapelo explains that the Unisa project team is also well aware that African languages have important and specialised terminology in specific fields. In this regard, it is their aim to make the multilingual linguistic terminology freely available in a large database so that these resources will contribute positively to the teaching and learning domain as well as to other forms of language practice such as language learning and interpretation.

“Open access to the African Wordnet data as well as the Multilingual Linguistic Terminology, is bound to have a significant impact, not only on the promotion of African languages, but also on the further development of natural language processing applications such as inter-lingual information retrieval, question-answering systems as well as machine translation,” explains Bosch.

Speaking more on the African Wordnet project, Bosch says that South Africa with its rich diversity of 11 official languages is seen as a potential emerging market where language technology (LT) applications can contribute to the promotion of multilingualism and language development, and as such have a positive impact on the South African community. In this regard, one of the fundamental resources required for the development of a large number of core language technologies (LTs) and LT applications, is a wordnet. A wordnet is a lexical database consisting of words that are grouped into sets of synonyms called synsets. Various conceptual-semantic and lexical relations are indicated between the synsets contained in a wordnet.

She explains that wordnets for African languages were introduced with a training workshop for linguists, lexicographers and computer scientists by international experts in 2007. Since then, wordnets for five African languages, namely Setswana (tsn), isiXhosa (xho), isiZulu (zul), Sesotho sa Leboa (nso) and Tshivenḓa (ven) have grown to roughly 10 000 synsets each, while the other four official African languages, namely Sesotho (sot), Xitsonga (tso), isiNdebele (nde) and Siswati (ssw), each boast with 1 000 synsets. Marissa Griesel, a PhD student in the Department of African Languages, is the general project manager.

Bosch concludes by highlighting that wordnets are not only useful, but indispensable components of large automatic language understanding systems being developed and tested in academia and industry. “Adding several South African languages to the wordnet web enables many such applications for each of these languages in isolation. Moreover, linking the South African wordnets to one another and to the many global wordnets makes cross-linguistic information retrieval and question answering possible, and significantly aids machine translation, which is an important contribution to the empowerment of the African languages.”

* Compiled by Rivonia Naidu-Hoffmeester, Communications and Marketing Specialist, College of Human Sciences and first published as a news article on https://www.unisa.ac.za/sites/corporate/default/Colleges/Human-Sciences/News-&-events/Articles/Developing-digital-African-language-resources

 Publish date: 2020/09/04

Global Wordnet Conference 2021 to be hosted in Pretoria, South Africa

Call for papers

11th International Global Wordnet Conference,
Pretoria, South Africa/Venue to be announced
18-21 January 2021

Global Wordnet Association: www.globalwordnet.org

Conference website: https://www.globalwordnet.co.za/

The Global Wordnet Association is pleased to announce that the 11th International Global Wordnet Conference (GWC2021) will be held from 18 to 21 January 2021 in Pretoria, South Africa. The conference will be hosted by the South African Centre for Digital Language Resources (SADiLaR).  Due to the uncertainty of what will be the circumstances during the conference as a result of COVID-19, the venue and format of conference presentation will be announced in due course.

COVID19 Pandemic
Please note that given the current global pandemic and the uncertainty surrounding international travel in January 2021, the organisers are also preparing a virtual conference as contingency. A final announcement as to the format of GWC2021 will be made in October 2020. Should the conference be hosted as a virtual event, a combination of the approaches successfully implemented by RAIL2020 and LREC2020 will be used as roadmap – authors will be asked to prerecord their presentations which will be streamed during a live virtual event with opportunity for questions and discussions after each presentation; proceedings will also be published and a reduced registration fee will be applicable.

Conference Topics
Contributions are invited on all aspects of wordnets, including but not limited to:

  1. Lexical semantics and meaning representation
  • Critical analysis and applications of lexical and semantic relations
  • Proposed new relations
  • Definitions, semantic components, co-occurrence and frequency statistics
  • Word and Sense Embeddings
  • Necessity and completeness issues
  • Ontology and wordnet
  • Other lexicographical and lexicological questions pertaining to wordnet-style meaning representation
  • Wordnets and Linked Open Data (LOD)
  1. Architecture of lexical databases
  • Language independent and language dependent components
  • Integration of multi-wordnets in research infrastructures (like CLARIN) and LT networks (like META-NET)
  1. Tools and Methods for wordnet development
  • User and Data entry interfaces
  • Methods for constructing, extending and enriching wordnets
  1. Applications of wordnet
  • Word sense disambiguation
  • Machine translation
  • Information extraction and retrieval
  • Document structuring and categorisation
  • Automatic hyperlinking
  • Language pedagogy
  • Psycholinguistic applications
  1. Standardization, distribution and availability of wordnets and wordnet tools.

Submission Guidelines
Submissions will fall into one of the following categories (page limits exclude references):

  • long papers: 8 pages max, 30 minutes presentation
  • short papers: 5 pages max; 15 minutes presentation
  • project reports: 5 pages max., 10 minutes presentation
  • demonstrations: 5 pages max, with an additional 3 pages screen dumps or images; 20 minutes presentation
  • work in progress: 5 pages max., poster presentation

Submissions should be anonymous and any identifying information must be removed. Authors must state the preferred category, though acceptance may be subject to change in the category of the presentation, e.g. a long paper submission may be accepted as a short paper.

Final papers should be submitted in electronic form (PDF only).

Papers must be submitted to the EasyChair website: https://easychair.org/conferences/?conf=gwc2021

The format of the paper is in ACL format (PDF) and a template can be found at http://acl2010.org/authors_final.html

Important Dates

  • 18 September 2020: Deadline for paper submission
  • 23 October 2020: Notification of acceptance
  • 23 October 2020: Registration opens
  • 20 November 2020: Deadline for author registration, final paper due
  • 18-21 January 2021: Conference

Proceedings
Conference proceedings will be open access and downloadable from the GWA website. The proceedings will have an ISBN and be published in the ACL anthology.

Papers are only included in the proceedings if at least one author has registered by 20 November 2020.

Inclusion of accepted submissions into the final program and the proceedings is contingent upon at least one author’s registration. Late registration and on-site registration for participants is possible without inclusion of the paper and without presentation.

Post-conference Workshop
A student workshop will be held at the conference venue on 22 January 2021. More details to follow during October 2020.

Student Bursaries and Travel Grants
Applications for student bursaries and travel grants will open in October 2020 and more details will be made available on the conference website.

Conference Chairs
Christiane Fellbaum, fellbaum@princeton.edu
Piek Vossen, piek.vossen@vu.nl

Local Organising Committee
Sonja Bosch – University of South Africa (UNISA)
Mampaka Lydia Mojapelo – University of South Africa (UNISA)
Marissa Griesel – University of South Africa (UNISA) (contact person: griesm@unisa.ac.za)
Elsabé Taljard – University of Pretoria (UP)
Juan Steyn – SADiLaR
Liané van den Berg – SADiLaR

New development phase for the AfWN

SADiLaR.UnisaWorkshopFeb2018.Image1

Linguists from the Sotho language group (Sesotho, Setswana and Sesotho sa Leboa) assessing examples from the OERTB.

The South African Centre for Digital Language Resources (SADiLaR) recently awarded funding for a new development phase in the African Wordnet project. This phase will see 6 000 new synsets including usage examples and definitions, as well as 5 000 new usage examples for existing synsets being added to the AWN. Two more African languages will also be included in the project during a 2 year period from 2017 to 2019. The project will continue to be hosted by the Department of African Languages at the University of South Africa (UNISA).

A hands-on workshop at UNISA on 9 February 2018 aimed at providing linguists with important skills in writing effective definitions and was facilitated by Dr Mariëtta Alberts, author of Terminology and terminography principles and practice: A South African Perspective [1]. 31 Linguists from UNISA, Tshwane University of Technology, University of Pretoria and the University of Venda attended the workshop and represented all 9 South African languages involved in the project. Dr Alberts led the linguists through the theory behind constructing definitions before they tried their hand at assessing some example definitions from the Open Educational Resource Term Bank [2] in smaller groups. During the feedback session, linguists working in different language groups could share specific pitfalls and suggest improvements to the examples.

We would like to thank Dr Alberts for sharing her expertise and experience with our project team, as well as Prof. Phaahla, CoD of the Department of African Languages at UNISA for sponsoring the venue for this event.

[1] Alberts, M. 2017. Terminology and Terminography: Principles and Practice. A South African Perspective. Milnerton: MLA Publications. 483 pages. ISBN 978-0-9947129-0-5

[2] See http://oertb.tlterm.com/ for more information on the resource.

This article first appeared in the SADiLaR internal newsletter, February 2018

Written by Marissa Griesel

Department of African Languages inspiring creation of digital text resources

Some of the delegates who attended the African Languages seminar, which looked at phase three of the African Wordnet project.

The African Wordnet project, funded by the Unisa Women-in-Research Fund, recently completed a third phase of development. Progress on this innovative project was shared at a seminar on 21 June 2017, entitled Creation of digital text resources for research and development of African languages.

Three international guests also shared their research on the collection of digital text resources with close on 50 delegates. The seminar attracted a multidisciplinary audience with colleagues from various departments in Unisa, as well as from the Meraka Institute (CSIR), University of Pretoria and Tshwane University of Technology participating in the discussion. Postgraduate students in both humanities and computer Science were also invited. The languages in the African Wordnet project are considered resource-scarce, compared to most other languages listed in the Global WordNet Association, in the sense that lexical resources are very limited. During the seminar, Professor Sonja Bosch, project leader for the African Wordnet project, began with an overview emphasising the prime importance of the sustainability and longevity of the African Wordnets as reusable, sharable, and expandable text resources.

She then introduced the South African digital resources landscape by sharing the exciting news of the newly approved South African Centre for Digital Language Resources (SADiLaR). This is a new research infrastructure set up by the Department of Science and Technology (DST) forming part of the new South African Research Infrastructure Roadmap (SARIR). Unisa’s Department of African Languages, as the only department in the country with expertise in all nine official African languages, has been identified as one of five nodes of SADiLaR, and will focus on language development.

Following this introduction, Marissa Griesel, project manager for the African Wordnets, gave some more detailed information on the current developments in the African Wordnet project. The various platforms on which the AWN data will be disseminated were discussed. This includes most notably as a searchable database on the Open Multilingual Wordnet project, as a full download from the South African Language Resource Management Agency and as part of a dictionary application for smartphones in the Kamusi project.

One very important aspect of all of these platforms is that the resource created in the African Wordnet project will be freely available for use by researchers and developers in African languages. The different platforms also make the AWN accessible to researchers from various disciplines, either as a browsable and user-friendly component or as a machine-readable XML database from the RMA.

Two project members, Professor Mampaka Lydia Mojapelo (Sesotho sa Leboa) and Dr Jurie le Roux (Setswana), then shared examples of linguistic phenomena that deserved special attention while compiling the African Wordnet for their language. Both linguists showed interesting examples where the African language differs from English and demonstrated the innovative ways in which they overcame these differences to ensure a true representation of their language, while staying true to the internationally accepted structure of a wordnet.

Guest speakers Professor Uwe Quasthoff, Dr Dirk Goldhahn, and Dr Thomas Eckart, affiliated to the Natural Language Processing Group, University of Leipzig, Germany, also shared their experiences concerning the construction of lexical resources. The scarceness of digital text resources for African languages can be addressed by a corpus collection initiative for under resourced languages. This initiative entails the semi-supervised crawling for Web texts in the South African languages, processing of corpora and free availability of all resulting data. The availability of large text resources in turn makes it possible to extract valuable information about word usage and typical word patterns. This information can be used as input for identifying candidates of paradigmatic relations in a semi-automatic approach of synset generation for lexical databases. For their unique approach to be successful, however, language experts are encouraged to suggest websites in the different African languages via the easy to use CURL portal (http://curl.corpora.uni-leipzig.de/).

Professor Puleng Segalo, Head of Research and Graduate Studies at in the College of Human Sciences, concluded the seminar by stating again the importance of initiatives such as the African Wordnet and CURL projects to further the African languages and their speakers. A lively discussion over some light refreshments saw colleagues and students discuss various collaboration opportunities and interesting research topics.

The Department of African Languages wishes to thank everyone for attending the seminar, for adding to the discussion and for their interest in the projects, and also to SADiLaR for sponsorship towards the seminar. A special word of thanks goes to our guest speakers, Professor Quasthoff, Dr Goldhahn and Dr Eckart for introducing new and exciting ways to get involved in resource creation for lesser-resourced languages and to Professor Segalo for being a champion for new research and the continued development of resources such as these within an academic sphere.

This article first appeared online on UNISA eNews, 20 July 2017

Written by Marissa Griesel