The languages that defy auto-translate - BBC News

autosoto.blogspot.com

There are more than 7,000 languages in the world, 4,000 of which are written. Yet only 100 or so can be translated by automated tools such as Google Translate. New research promises to let us communicate with the others too.

Imagine you come across a message that could contain life-saving information. But there's a problem: you don't understand a word. You're not even sure which of the world's thousands of languages it is written in. What do you do?

If the message is in French or Spanish, typing it into an automatic translation engine will instantly solve the mystery and produce a solid answer in English. But many other languages still defy machine translation, including languages spoken by millions of people, such as Wolof, Luganda, Twi and Ewe in Africa. That's because the algorithms that power these engines learn from human translations – ideally, millions of words of translated text.

There is an abundance of such material for languages like English, French, Spanish and German, thanks to multilingual institutions like the Canadian parliament, the United Nations and the European Union. Their human translators churn out streams of translated transcripts and other documents. The European Parliament alone produces a data trove of 1.37 billion words in 23 languages over a decade.

No such data mountain exists, however, for languages that may be widely spoken but not as prolifically translated. They are known as low-resource languages. The fallback machine-training material for these languages consists of religious publications, including the much-translated Bible. But this amounts to a narrow dataset, and is not enough to train accurate, wide-ranging translation robots.

Google Translate currently offers the ability to communicate in around 108 different languages while Microsoft's Bing Translator offers around 70 languages. Yet there are more than 7,000 spoken languages around the world, and at least 4,000 with a writing system.

That language barrier can pose a problem for anyone who needs to gather precise, global information in a hurry – including intelligence agencies.

The United Nations produces volumes of translated text every year that can be used to train algorithms (Credit: Mohammed Elshamy/Getty Images)

"I would say the more interested an individual is in understanding the world, the more one must be able to access data that are not in English," says Carl Rubino, a programme manager at IARPA, the research arm of US intelligence services. "Many challenges we face today, such as economic and political instability, the Covid-19 pandemic, and climate change, transcend our planet – and, thus, are multilingual in nature."

Training a human translator or intelligence analyst in a new language can take years. Even then, it may not be enough for the task at hand. "In Nigeria, for instance, there are over 500 languages spoken," Rubino says. "Even our most world-renowned experts in that country may understand just a small fraction of those, if any."

To break that barrier, IARPA is funding research to develop a system that can find, translate and summarise information from any low-resource language, whether it is in text or speech.

Picture a search engine where the user types in the query in English, and receives a list of summarised documents in English, translated from the foreign language. When they click on one, the full translated document comes up. While the funding comes from IARPA, the research is carried out openly by competing teams, and much of it has been published.

Kathleen McKeown, a computer scientist at Columbia University who leads one of the competing teams, sees benefits beyond the intelligence community. "The ultimate goal is to facilitate more interaction between, and more information about, people from different cultures," she says.

The research teams are using neural network technology to tackle the problem, a form of artificial intelligence that mimics some aspects of human thinking. Neural network models have revolutionised language processing in recent years. Instead of just memorising words and sentences, they can learn their meaning. They can work out from the context that words like "dog", "poodle", and the French "chien" all express similar concepts, even if they look very different on the surface.

To do this, however, the models usually need to go through millions of pages of training text. The challenge is to get them to learn from smaller amounts of data – just like humans do. After all, humans don't need to read years' worth of parliamentary records to learn a language.

You might also like:

"Whenever you study a language, you would never, ever in your lifetime see the amount of data today's machine translation systems use for learning English-to-French translation," says Regina Barzilay, a computer scientist at MIT who is a member of another of the competing teams. "You see a tiny, tiny fraction, which enables you to generalise and to understand French. So in the same way, you want to look at the next generation of machine-translation systems that can do a great job even without having this kind of data-hungry behaviour."

To tackle the problem, each team is divided into smaller specialist groups that solve one aspect of the system. The main components are automatic search, speech recognition, translation and text summarisation technologies, all adapted to low-resource languages. Since the four-year project began in 2017, the teams have worked on eight different languages, including Swahili, Tagalog, Somali and Kazakh.

Machine-powered translation tools can provide vital ways of communicating in situations where a human translator may not be available (Credit: Maciej Luczniewski/Getty Images)

One breakthrough has been to harvest text and speech from the web, in the form of news articles, blogs and videos. Thanks to users all over the world posting content in their mother tongues, there is a growing mass of online data for many low-resource languages.

"If you search the internet, and you want data in Somali, you get hundreds of millions of words, no problem," says Scott Miller, a computer scientist at the University of Southern California who co-leads one of the research teams working on this. "You can get text in almost any language in fairly large quantities on the web."

This online data tends to be monolingual, meaning that the Somali articles or videos are just in that language, and don't come with a parallel English translation. But Miller says neural network models can be pre-trained on such monolingual data in many different languages.

It is thought that during their pre-training, the neural models learn certain structures and features of human language in general, which they can then apply to a translation task. What these are is a bit of a mystery. "No one really knows what structures these models really learn," says Miller. "They have millions of parameters."

But once pre-trained on many languages, the neural models can learn to translate between individual languages using very little bilingual training material, known as parallel data. A few hundred thousand words of parallel data are enough – about the length of a few novels.

The multilingual search engine will be able to comb through human speech as well as text, which presents another set of complex problems. For example, speech recognition and transcription technology typically struggles with sounds, names and places it has not come across before.

"My example would be a country that's maybe relatively obscure to the West, and perhaps a politician gets assassinated," says Peter Bell, a specialist in speech technology at the University of Edinburgh who is part of one of the teams trying to tackle this problem. "His name is now really important, but previously, it was obscure, it didn't feature. So how do you go and find that politician's name in your audio?"

One solution used by Bell and his collaborators is to go back to words that were initially transcribed with a measure of uncertainty, indicating that the machine was not familiar with them. On re-inspection, one of them may turn out to be the previously obscure, little-known name of the politician.

Once it has found and translated the relevant information, the search engine sums it up for the user. It's during this summarising process that neural models display some of their strangest behaviour – they hallucinate.

Breaking down language barriers could bring benefits that go far beyond the intelligence agencies (Credit: Getty Images)

Imagine you are searching for a news report about protesters who stormed a building on a Monday. But the summary that comes up says they stormed it on a Thursday. This is because the neural model drew on its background knowledge, based on millions of pages of training text, when it summarised the report. In those texts, there were more examples of people storming buildings on Thursdays, so it concluded this should apply to the latest example too.

Similarly, neural models may insert dates or numbers into a summary. Computer scientists call this hallucinating.

"These neural network models, they're so powerful, they have memorised a lot of languages, they add words that were not in the source," says Mirella Lapata, a computer scientist at the University of Edinburgh who is developing a summarisation element for one of the teams.

Lapata and her colleagues have avoided the problem by extracting keywords from each document, rather than telling the machine to sum it up in sentences. Keywords are less elegant than sentences, but they limit the models' tendencies to write robot poetry.

While the search engine is designed for living languages, the project includes a sub-group working on languages that have not been spoken in thousands of years. Such ancient languages are extremely low-resource, since many survive only as text fragments. They provide a useful testing ground for techniques that could then be applied to modern low-resource languages.

Barzilay’s PhD student at MIT, Jiaming Luo, and their collaborators developed an algorithm that can work out if certain ancient languages have modern survivors. They gave it a head-start by feeding it basic information about these languages, and about general aspects of language change. With this knowledge, the model was able to make some discoveries on its own, using only a small amount of data. It correctly worked out that Ugaritic, an ancient language from the Near East, is related to Hebrew. It also concluded that Iberian, an ancient European language, is closer to Basque than to other European languages – though not close enough to be a near relative.

Barzilay hopes that such approaches could inspire broader change, and make neural models less data-hungry. "Our dependence on huge parallel data – it's a weakness of the system," she says. "So if you are really producing good technology, be it for decipherment, be it for small languages, it's going push the field forward."

The teams have all managed to produce basic versions of the multilingual search engine, refining it with each new language. Rubino, the IARPA programme manager, believes such technologies could change how intelligence is gathered. "We will indeed have the opportunity to revolutionise the way our analysts learn from foreign language data, allowing monolingual English speaking analysts access to multilingual data they previously were not able to work with," he says.

Machine learning could help to decipher extinct languages such as Ugaritic, which was used in northern Syria in 14th-12th Century BC (Credit: API/Gamma-Rapho/Getty Images)

While intelligence analysts are trying to prise open low-resource languages from the outside, native speakers of those languages are also taking matters into their own hands. They, too, want access to urgent information in other languages – not for espionage, but to improve their everyday lives.

"When this Covid-19 pandemic happened, there was a sudden need to translate basic health tips into many languages. And we couldn't do this with machine translation models, because of the quality," says David Ifeoluwa Adelani, a doctoral student in computer science at Saarland University in Saarbrücken, Germany. "I think this has really taught us that it's important that we have technology that works for low-resource languages, especially in time of need."

Adelani is originally from Nigeria and a native Yorùbá speaker, and has been building a Yorùbá-English database as part of a non-profit project called Cracking the Language Barrier for a Multilingual Africa. He and his team created a new dataset by gathering translated movie scripts, news, literature and public talks. They then used this dataset to fine-tune a model already trained on religious texts, such as Jehovah's Witnesses publications, improving its performance. Similar efforts are underway for other African languages like Ewe, Fongbe, Twi and Luganda, helped by grassroots communities such as Masakhane, a network of researchers from all over Africa.

One day, all of us may be using multilingual search engines in our everyday lives, unlocking the world's knowledge at the click of a button. Until then, the best way to really understand a low-resource language is probably to learn it – and join the multilingual, online human chatter that trains the world's translation robots.

Join one million Future fans by liking us on Facebook, or follow us on Twitter or Instagram.

If you liked this story, sign up for the weekly bbc.com features newsletter, called "The Essential List". A handpicked selection of stories from BBC Future, Culture, Worklife, and Travel, delivered to your inbox every Friday.