Department of African Languages inspiring creation of digital text resources

Some of the delegates who attended the African Languages seminar, which looked at phase three of the African Wordnet project.

The African Wordnet project, funded by the Unisa Women-in-Research Fund, recently completed a third phase of development. Progress on this innovative project was shared at a seminar on 21 June 2017, entitled Creation of digital text resources for research and development of African languages.

Three international guests also shared their research on the collection of digital text resources with close on 50 delegates. The seminar attracted a multidisciplinary audience with colleagues from various departments in Unisa, as well as from the Meraka Institute (CSIR), University of Pretoria and Tshwane University of Technology participating in the discussion. Postgraduate students in both humanities and computer Science were also invited. The languages in the African Wordnet project are considered resource-scarce, compared to most other languages listed in the Global WordNet Association, in the sense that lexical resources are very limited. During the seminar, Professor Sonja Bosch, project leader for the African Wordnet project, began with an overview emphasising the prime importance of the sustainability and longevity of the African Wordnets as reusable, sharable, and expandable text resources.

She then introduced the South African digital resources landscape by sharing the exciting news of the newly approved South African Centre for Digital Language Resources (SADiLaR). This is a new research infrastructure set up by the Department of Science and Technology (DST) forming part of the new South African Research Infrastructure Roadmap (SARIR). Unisa’s Department of African Languages, as the only department in the country with expertise in all nine official African languages, has been identified as one of five nodes of SADiLaR, and will focus on language development.

Following this introduction, Marissa Griesel, project manager for the African Wordnets, gave some more detailed information on the current developments in the African Wordnet project. The various platforms on which the AWN data will be disseminated were discussed. This includes most notably as a searchable database on the Open Multilingual Wordnet project, as a full download from the South African Language Resource Management Agency and as part of a dictionary application for smartphones in the Kamusi project.

One very important aspect of all of these platforms is that the resource created in the African Wordnet project will be freely available for use by researchers and developers in African languages. The different platforms also make the AWN accessible to researchers from various disciplines, either as a browsable and user-friendly component or as a machine-readable XML database from the RMA.

Two project members, Professor Mampaka Lydia Mojapelo (Sesotho sa Leboa) and Dr Jurie le Roux (Setswana), then shared examples of linguistic phenomena that deserved special attention while compiling the African Wordnet for their language. Both linguists showed interesting examples where the African language differs from English and demonstrated the innovative ways in which they overcame these differences to ensure a true representation of their language, while staying true to the internationally accepted structure of a wordnet.

Guest speakers Professor Uwe Quasthoff, Dr Dirk Goldhahn, and Dr Thomas Eckart, affiliated to the Natural Language Processing Group, University of Leipzig, Germany, also shared their experiences concerning the construction of lexical resources. The scarceness of digital text resources for African languages can be addressed by a corpus collection initiative for under resourced languages. This initiative entails the semi-supervised crawling for Web texts in the South African languages, processing of corpora and free availability of all resulting data. The availability of large text resources in turn makes it possible to extract valuable information about word usage and typical word patterns. This information can be used as input for identifying candidates of paradigmatic relations in a semi-automatic approach of synset generation for lexical databases. For their unique approach to be successful, however, language experts are encouraged to suggest websites in the different African languages via the easy to use CURL portal (

Professor Puleng Segalo, Head of Research and Graduate Studies at in the College of Human Sciences, concluded the seminar by stating again the importance of initiatives such as the African Wordnet and CURL projects to further the African languages and their speakers. A lively discussion over some light refreshments saw colleagues and students discuss various collaboration opportunities and interesting research topics.

The Department of African Languages wishes to thank everyone for attending the seminar, for adding to the discussion and for their interest in the projects, and also to SADiLaR for sponsorship towards the seminar. A special word of thanks goes to our guest speakers, Professor Quasthoff, Dr Goldhahn and Dr Eckart for introducing new and exciting ways to get involved in resource creation for lesser-resourced languages and to Professor Segalo for being a champion for new research and the continued development of resources such as these within an academic sphere.

This article first appeared online on UNISA eNews, 20 July 2017

Written by Marissa Griesel


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s