How to Overcome the Need for Data for Low-Resource Languages


How to Overcome the Need for Data for Low-Resource Languages

Neural machine translation (NMT) has revolutionized the language services industry, enabling providers to deliver large volumes of translated content quickly, affordably, and accurately. This presents a problem for low-resource languages (LRLs) that need more linguistic resources, such as extensive parallel corpora, annotated data, and pre-trained language models. 

Only about 20 of the world’s 7,000 languages are resource-rich today. Languages such as English, French, and Chinese dominate the NMT space and produce translations with a high degree of accuracy. That means people who speak languages of lesser diffusion often have limited access to translations in their native tongue.    

To solve this problem, the machine translation community has developed several approaches, including transfer learning, multilingual learning, and data augmentation.  

Three Major Challenges for Low-Resource Languages  

Before we dive into NMT techniques, let’s explore a few challenges of gathering training data for low-resource languages. These include: 

  • Lack of annotated datasets. Supervised machine learning (ML) requires annotations. Models like these are commonly used for specific tasks, such as accurately detecting hate speech. However, creating annotated datasets requires human intervention by labeling every training example individually, making the process costly and time-consuming. 

Consequently, manual data creation could be more sustainable. For instance, a dataset of 10,000 examples might take months to compile due to the manual annotation process, causing a delay in developing new ML models.

  • Lack of unlabeled datasets. Unlabeled datasets, such as text corpora, are precursors to annotated versions, and they’re essential for training base models that can later be fine-tuned for specific tasks. Given this, approaches to circumvent unlabeled datasets are crucial. For example, distant supervision and self-supervised learning can automatically label unlabeled datasets at a fraction of the cost of manual annotation.
  • Supporting multiple dialects of a language. Languages with numerous dialects pose unique challenges, especially for speech models. In most cases, a model trained in one dialect won’t do well in another. For example, most unlabeled and annotated Arabic datasets are in Modern Standard Arabic, but many Arabic speakers find it too formal when interacting with voice or chat assistants. That is why practical use cases require dialect support.

Keep in mind that this list of challenges grows with every low-resource language. As a result, even large corporations that develop NLP software only support a small number of LRLs.  

Three NMT Techniques for Low Resource Languages 

Fortunately, there are a few ways researchers can train machine learning models on LRLs despite limited data. Below is a brief overview of three of the most common.   

Cross-Lingual Transfer Learning 

Transfer learning refers to using pre-trained models from one language pair to improve model performance on another. Simply put, it allows knowledge from one language pair to be leveraged from another. This can be particularly useful in low-resource scenarios where training data is limited. It includes: 

  • Transfer of annotations such as part-of-speech (POS) tags, syntactic, or semantic features via cross-lingual bridges (e.g., word or phrase alignments). This requires understanding the relationship between the source and target languages based on linguistic knowledge and resources. For example, the French word “chat” translates to “cat” in English due to the shared Latin root of the two words. Therefore, a POS tagger trained in French would tag the English word “cat.”
  • Transfer of models in a high-resource language to a low-resource language through zero-shot or one-shot learning. Zero-shot learning assumes a model trained in one domain will effectively generalize to a low-resource environment. Similarly, one-shot learning uses limited example data from a low-resource domain to adapt a model trained with rich resources. For example, it’s possible to create a zero-shot model for Welsh to classify text accurately by leveraging transfer learning from a model trained in English.

Multilingual Learning 

Multilingual Learning involves training one model in multiple languages, and this model assumes that similar words and sentences from different languages will have similar representations. As a result, it can also be helpful for cross-lingual transfer learning since knowledge from high-resource languages like English can be transferred to low-resource languages like Swahili. This allows base models to perform better on low-resource languages despite a lack of text corpora.

Data Augmentation 

Data augmentation is a set of techniques that can create additional data by modifying existing data or adding data from different sources. Instead of altering NMT models, it generates data to train them. Techniques include: 

  • Word alignment. Also known as phrase-replacement augmentation, word alignment uses a subset of sentences from an existing parallel or monolingual corpus to generate original sentences by replacing words or phrases. For example, given the sentence “She is playing the violin,” a word alignment system may generate another sentence like “He is playing the guitar.”
  • Back-translation. This technique improves translation output quality. The machine translation model is trained on a target language corpus, and then its translation output is translated back into the source language using another machine translation model. For instance, a Swahili sentence can be translated into English, then back into Swahili, and compared with the original sentence to check for accuracy.
  • Parallel data mining. This is the process of collecting parallel corpora for neural machine translation models. Parallel corpora are collections of texts in two or more languages that align at the sentence or phrase level. This allows direct comparison and translation between the languages. The Tatoeba Project is an excellent example. It’s an open-source collection of sentences and translations in 285 languages, which can be used to build parallel corpora for machine translation.

How to Increase Equity and Inclusivity in NMT 

Each NMT technique has its advantages. However, they aren’t always enough to build genuinely diverse language models. That requires a blend of local community involvement, public access to tools and training models, and distribution of computational resources. 

Massive Crawl and Data Set Creation

Most NMT datasets originated in a few regions focusing on English-centric translations. Yet about 40% of online content is now in other languages. To capture this information, projects such as Paracrawl automatically mine large amounts of parallel data for multiple languages, including LRLs such as Irish and Nepalese. 

However, crawling tools cannot detect racial and gender bias, hate speech, or illegal content. Researchers should involve local communities, which can make better judgments about bias when creating datasets.

Open-Source Tools and Frameworks

Access to tools and frameworks for low-resource languages is critical to the field’s advancement. Making them free-access and open-source helps to benefit the whole LRL community. 

The Low Resource Languages for Emergent Incidents (LORELEI) project provides a suite of open-source tools to assist with the automatic processing of low-resource languages. Other projects like CLARIN facilitate sharing data and tools across various languages, focusing on low-resource languages.

Public Access to Trained NMT Models

By making these expensive, time-consuming models public, they can be used as parents for LRL children. Publicly releasing NMT models, including massive multi-NMT models, would benefit those working on LRL-NMT and advancing other areas of NMT. 

Since Google released its publicly available multilingual NMT model, researchers have quickly developed better models for target languages, given the availability of parent models.

Availability of Computational Resources 

Researchers in the LRL community focus on developing computationally efficient NMT models and providing computational resources. However, research organizations, industry organizations, and governments need to do more to distribute resources. They can make computational resources available at conferences and challenges to encourage LRL researchers. For example, the AI4All Open Challenge was created to offer computational resources to LRL researchers by providing access to GPUs and cloud computing resources.

The Future of MNT for Low-Resource Languages 

As you can see, several promising approaches to tackling low-resource languages exist, from learning models and data annotation to community involvement and open-access tools. Despite this, there still needs to be a universal solution to cover all languages. We must keep exploring and developing effective methods to bridge the digital gap between low-resource languages and the global language landscape. It’s a challenge that requires effort from us all, and we can ensure every language is included with the right tools and resources.