The prosperity of indigenous language is a priority at the North-West University

The South African Centre for Digital Language Resources is a national research infrastructure that is hosted at the NWU and is funded by the Department of Science and Innovation.

The South African Centre for Digital Language Resources is a national research infrastructure that is hosted at the NWU and is funded by the Department of Science and Innovation.

Published Feb 9, 2023


Author: Bertie Jacobs

It is no secret that, at the North-West University (NWU), all 11 the official South African languages are seen as indispensable custodians of knowledge.

They are the mediums through which we learn and through which we then communicate the information that we have acquired. They allow us to grow as a society. In order to use them as tools to achieve our goals, they need to be preserved and promoted.

They need to be made functional in a myriad of forms – especially in a digital age in which we are moving at breakneck speed. It is a task of paramount importance, and an area in which the South African Centre for Digital Language Resources (SADiLaR) at the NWU is at the national forefront of digital language innovation.

SADiLaR is a national research infrastructure (RI) that is hosted at the NWU and is funded by the Department of Science and Innovation (DSI). Its strategic mandate is to create, manage and distribute digital resources as well as applicable software in all the official South African languages in order to stimulate and support research and development in the humanities and social sciences.

“SADiLaR is the first and currently the only research centre of its kind in Africa. As a national research infrastructure it has a hub that is linked to various nodes located at different universities and at South Africa's Council for Scientific and Industrial Research (CSIR) that assist in the digital development, research and support of all the official languages of South Africa,” explains Professor Langa Khumalo, Director of SADiLaR.

“The programme at SADiLaR is broadly divided into two imperatives. The first is to drive the digitisation programme, which involves the creation of relevant text, speech and multimodal resources for the research and development of the 11 official languages of South Africa. The second is a programme to ignite, cultivate and grow the scholarship of digital humanities, with a special focus on human capacity development,” says Khumalo.

He is also adamant that the digitisation of indigenous languages is a crucial endeavour, especially considering the unique questions posed by the digital age.

“The context of the Fourth Industrial Revolution impels us to develop digital resources for indigenous languages so that these languages can maintain their social relevance in future. The future will become increasingly more digitised and if we want to continue with the use of these languages, their digitisation and the development of human language technologies become imperative. We need to bring African languages into the cyber and digital infrastructure for posterity. Just recently, Google utilised data aggregated by SADiLaR to improve speech recognition quality for under-resourced African languages. This is very important in demonstrating that investing in language resource creation is needed not just in South Africa but internationally, and if done correctly, it can enable role players in the sector to create user-facing technologies that add value and increase access to information.”

However, it is a task that has some difficulties to overcome.

Digitisation is an important precursor in the development of an array of human language technologies. For these to be developed and supported for indigenous languages, there is a need for these languages to digitise resources. The development of various corpora in these languages – text corpora, speech corpora and multimodal resources – is vital.

Various challenges are found in the digitisation of datasets of indigenous languages. These range from the quality of the datasets to the accessibility of these datasets. The biggest obstacle, though, is copyright. SADiLaR has engaged aggressively with stakeholders that have these datasets in order to clear the copyright issues before they can access and later share only the datasets.

SADiLaR has a specialisation node that deals with digitisation and they are doing an excellent job digitising datasets for indigenous languages.

One of the core undertakings of the NWU is to make every voice heard, and through SADiLaR and the university’s inclusive language policy – to name but a few – more roads are being paved for every South African to discover.