Google adds Sepedi and Xitsonga to Translate using data from a thousand languages

Google Translate has been helping people communicate no matter the language differences. Now with advanced technology more South African languages have been added to the service. Picture: Clay Banks/UnSplash

Google Translate has been helping people communicate no matter the language differences. Now with advanced technology more South African languages have been added to the service. Picture: Clay Banks/UnSplash

Published Jun 3, 2022

Share

Google Translate is the most used translation service on the web, with over 500 million people using it daily to translate 133 languages. Now, thanks to technical advances by Google’s AI research team, there has been the addition of support for both Sepedi and Xitsonga, which together are spoken by 21 million people across South Africa and its neighbours.

The languages were two of an additional 24 added to Google Translate.

Machine learning is a subset of artificial intelligence (AI), where statistical models learn how to perform a task by looking at examples, rather than being programmed to do the task a certain way.

Advances in machine learning have been responsible for the explosion of performance and media coverage around AI over the last decade.

Google Translate switched over to machine learning techniques in 2016, providing significant improvements to translations for many languages.

However, these techniques require vast amounts of data to be effective, and for translation that has traditionally meant ‘parallel text’ – digital text that has full translations available. This creates a clear limitation for languages which do not have large bodies of translated works, or simply do not have much digitised text such as some of our official languages.

For many languages, only relatively small amounts of monolingual are available.

In a remarkable step forward for translation technologies, Google has developed tools which allow for effective translation using only monolingual text – a process called Zero-Shot Machine Translation.

The strength of machine learning is finding statistical relationships in mountains of training data, which can be generalised to new data.

Instead of training the model using data from one or two languages, Google has leveraged the vast dataset it has collected of text data from over 1000 languages. In this way, the model not only learns some things about the specific language by looking at monolingual text, but also learns to use the commonalities that underlie human language in general to generate translations.

To be clear, these systems do not ‘understand’ language, but are very good at giving translations that it has learned are most likely to be correct.

Local language experts were consulted both to evaluate translations and help develop new tools for collecting and cleaning data. Ultimately, translations for 24 languages were considered meaningful and useful enough to add to the Google Translate service.

These include Sepedi and Xitsonga, which bring the number of official South African languages supported by Google Translate to seven. Translations for Sesotho, Afrikaans, isiXhosa, isiZulu (and English of course) are already available.