Microsoft’s Project ELLORA
To bring ‘rare’ Indian languages online, Microsoft launched project ELLORA or Enabling Low Resource Languages in 2015. Under the project, researchers are building digital resources of the languages. They say that their purpose is to preserve a language for posterity so that users of these languages “can participate and interact in the digital world.”
How is ELLORA creating a language dataset?
The researchers are mapping out resources, including printed literature, to create a dataset to train their AI model. The team is also working with these communities on the project, Microsoft said.
“By involving the community in the data collection process, they [researchers] hope to create a dataset that is both accurate and culturally relevant,” the company noted.
Microsoft working with Mundas
Microsoft is currently working with Mundas community of about a million people spread across the eastern Indian states of Jharkhand, Orissa and West Bengal.
The community speaks Mundari, however, as per the Microsoft researchers, the community is concerned about the longevity of their language as only prominent languages like Bengali, Hindi and Odiya are taught to kids in schools.
A handful of researchers at the MSR lab in India have been working toward creating digital ecosystems for languages, like Mundari, that have a written script but don’t have enough presence in the digital world.
Internet’s language is English
English has been the internet’s language since its earliest years. Things improved and now there are eight out of nearly 6,000 languages around the world that are preferred online. This translates to 88% of the world’s languages do not have enough of a presence on the internet. It also means that 1.2 billion people, which is 20% of the world’s population, can’t use their language to navigate the digital world.
Hindi-to-Mundari: Work in progress
Microsoft says that its research team is currently working on a Hindi-to-Mundari text translation as well as a speech recognition model that will provide the community access to more content in Mundari.
Microsoft said that its researchers collaborated with IIT Kharagpur in 2018 “and sponsored a study to find what the community needs to keep the language alive.”
They are also building a text-to-speech model that doesn’t have significant digital content to train machine learning models. IIT Kharagpur professors initially worked with members of the community to help them manually translate sentences from Hindi to Mundari. The speech collection is done on a smartphone using Karya app.
The researchers also developed new technology called Interneural Machine Translation (INMT), which helps predict the next word when someone is translating between languages and speeds up the translation process.
Apart from Munda language, Microsoft is also working with Gondi speakers and the Idu Mishmi community in Arunachal Pradesh.
Meta’s language translation AI tool
Facebook parent-Meta is also working on something similar. Last year, the company announced that it developed an AI translation tool that can convert an unwritten (or oral) language to spoken English. An unwritten language is one that does not have a widely used writing system and is primarily spoken.
The company said that its AI was able to convert Hokkien – an oral language, to English. Hokkien is one of 3,500 languages that are spoken and do not have any written system (or at least not wide enough to train an AI model.)
Is ChatGPT the Google killer? | OpenAI ChatGPT