Written by Zhang Xiaojun, Lai Wen
Abstract:
Although neural machine translation models perform well with language pairs that have large-scale high-quality parallel corpora, experiments show that the performance of neural machine translation significantly decreases for low-resource language pairs, even underperforming traditional statistical machine translation models. This article analyzes the current challenges faced by neural machine translation for low-resource languages, particularly the scarcity of training data, through data augmentation and model fusion. Based on the type of corpora used during the training process, low-resource language neural machine translation methods are categorized into supervised, semi-supervised, and unsupervised methods. Using data augmentation techniques, a preliminary attempt at supervised methods was made, along with research on semi-supervised neural machine translation based on pivot languages, and improvements were made to the translation model based on the recently proposed unsupervised neural machine translation model with word embeddings. Experimental results indicate that the outcomes of supervised and semi-supervised machine translation methods have significantly improved.
Keywords:
Low-resource languages; Neural machine translation; Data augmentation; Model fusion
Despite significant advancements in various natural language processing technologies, there are over 7,000 languages in the world, and research in natural language processing is mostly focused on just a few languages (around 20) that have large-scale text corpora, such as English or other widely used languages. Most other languages urgently need corresponding language processing tools and corpus resources to meet the computational demands of current deep learning models. These languages are referred to as low-resource languages. Since 2014, end-to-end and attention-based neural machine translation models have matured, achieving substantial improvements in translation performance compared to traditional statistical machine translation models. Having high-quality, large-scale parallel corpora between two or more languages is a prerequisite for achieving good performance in today’s neural machine translation models. However, aside from a few resource-rich languages like English, German, and Chinese, most languages in the world cannot find large-scale bilingual parallel corpora to meet the needs of neural machine translation models. How to solve the machine translation problem for low-resource languages has become a hot research area in machine translation.
1 Data Augmentation and Model Fusion
Data scarcity is the main challenge faced by low-resource language machine translation, and there are two approaches to address this issue: one is to fully utilize the existing bilingual parallel training corpus, which is referred to as “data augmentation”; the other is to integrate the language model from monolingual training data with the translation model, known as “model fusion”.
1.1 Enhancing Existing Bilingual Parallel Corpora
Back translation is one of the most common data augmentation methods in machine translation tasks. Its main idea is to use a target language-source language translation model (back translation model) to generate pseudo-bilingual sentence pairs for training a source language-target language translation model (forward translation model). The back translation method requires only one back translation model, allowing the use of data generated by machine translation to increase the quantity of training data, leading to its widespread application. For low-resource language machine translation, by copying target language sentences to the source language side, pseudo-training data can be constructed, which enhances the performance of machine translation. Even if the constructed pseudo-training data is inaccurate, the target language side still contains real data, preserving the mutual translation information between the two languages while introducing some noise. Training the neural machine translation model on pseudo-bilingual sentence pairs allows it to learn how to handle noisy inputs, thereby improving the model’s robustness. Since the training of the back translation model relies only on limited bilingual data, the quality of the generated pseudo-source language data is difficult to guarantee. Therefore, an iterative back translation method can be employed to continuously improve the performance of both forward and back translation models. If only monolingual data of the source language is available, the corresponding target language translations can also be obtained through a forward translation model trained on bilingual data, thereby constructing pseudo-training data for the forward translation model. However, due to the difficulty in guaranteeing the quality of the generated translations, the constructed pseudo-training data does not significantly aid in the fluency of the translations, primarily serving to enhance the feature extraction performance of the encoder.
Word replacement involves replacing parts of the words in the bilingual corpus with other words from the vocabulary. By replacing words while ensuring the semantic or grammatical correctness of the sentences, the modified sentence pairs can be added to the training data to increase its diversity. Words can be replaced in the source language or the target language; common words can be replaced as well as rare words; replacements can be “deliberate” or random; one word can be replaced or omitted or masked; replacements can be made with other words from the vocabulary or from the same sentence. The essence of the word replacement method is to modify the original bilingual training data to obtain noisy pseudo-bilingual training data. In neural machine translation, a common method for data augmentation by adding noise is to introduce some noise into the original bilingual corpus while ensuring that the overall meaning of the sentence remains unchanged, thereby generating pseudo-bilingual data to expand the scale of the original training data.
Compared to the word replacement method, paraphrasing does not just involve slight modifications to the sentences but considers the diversity of natural language expression. By rewriting the original sentences using different structures to convey the same meaning, the paraphrasing method allows the training data to cover more linguistic phenomena and is widely applied in neural machine translation tasks. Moreover, since each sentence can correspond to multiple different translations, the paraphrasing method can help avoid overfitting of the model, thereby enhancing its generalization ability.
1.2 Integrating Monolingual Language Models
Typically, in machine translation systems, monolingual data is used to train language models, which can be utilized on both the source and target language sides. On the source language side, the language model can be used for sentence encoding and generating sentence representations; on the target language side, the language model aids the system in selecting smoother translations. In low-resource machine translation, integrating language models can help mitigate the shortcomings of scarce bilingual training data to some extent.
Since the decoder of the neural machine translation model is essentially a language model that describes the patterns of generating translated word sequences, integrating the decoder with the target language model becomes the most straightforward way to use monolingual data. Common fusion methods are divided into shallow fusion and deep fusion. The former independently trains the translation model and the language model, and when generating each word, the predicted probabilities of the two models are weighted and summed to obtain the final predicted probability; the latter jointly trains the translation model and the language model, dynamically calculating the weight of the language model during the decoding process to compute prediction probabilities.
Similarly, the encoder of the neural machine translation model serves as the language model for the source language, but the encoder does not directly output the generation probability of the source language sentences. Therefore, larger-scale monolingual data can be used to train the encoder. A pre-trained encoder can be directly used in conjunction with the decoder of the machine translation model to complete the translation task, a method known as pre-training. This separates the task of learning sentence representations from the translation task, allowing the use of additional larger-scale monolingual data to learn and obtain the initial values of some model parameters in the neural machine translation model; subsequently, fine-tuning is performed on bilingual data to obtain the final translation model. The representation learning results for each individual word are referred to as fixed word embeddings, but the same word often conveys different meanings in different contexts.
The model needs to understand the specific meaning of each word in the current context, thus requiring contextual word embeddings. Compared to fixed word embeddings, contextual word embeddings contain semantic information from the current context, enriching the model’s input representation and reducing training difficulty. However, to facilitate the extraction of the entire sentence representation, the model still has a large number of parameters to learn. Consequently, many pre-trained models have been proposed. Currently, the most widely used are Generative Pre-training (GPT) and Bidirectional Encoder Representations from Transformers (BERT). Pre-trained models do not show significant improvements in translation for resource-rich languages, but they perform remarkably well in low-resource language machine translation. This is because the training corpus during the pre-training phase is very large, thus helping significantly when the data volume for downstream tasks is low, which is precisely the case for low-resource language machine translation. Machine translation is a typical language generation task that involves not only the problem of source language representation learning but also the mapping from sequence to sequence and the generation of target language sequences. This knowledge cannot be learned through training with monolingual data from the source language alone; therefore, it is necessary to use monolingual data to pre-train the encoder-decoder structure.
2 Neural Machine Translation Methods for Low-Resource Languages
Despite the good performance of neural machine translation models with large-scale high-quality parallel corpora, experiments have proven that the performance of neural machine translation significantly declines for low-resource language pairs, even falling below that of traditional statistical machine translation models. Researchers have fully utilized available corpus resources to explore various neural machine translation methods in low-resource scenarios. This article categorizes the current neural machine translation methods for low-resource languages into supervised, semi-supervised, and unsupervised methods based on the type of corpora used during the training process.
2.1 Supervised Methods in Neural Machine Translation for Low-Resource Languages
Supervised methods in neural machine translation for low-resource languages refer to methods that require direct provision of bilingual parallel corpora between the source language and target language throughout the model training process. Supervised methods can further be divided into four types: back translation, word replacement, transfer learning, and meta-learning. The first two methods focus on increasing the training data, as mentioned previously, and will not be elaborated upon further. Notably, the back translation method has been recognized as an indispensable step to enhance machine translation performance in both domestic and international machine translation evaluation competitions. Transfer learning refers to the use of knowledge gained from known tasks to improve performance on related tasks, typically reducing the amount of training data required. For neural machine translation models, the main idea of transfer learning is to first train a neural machine translation model (parent model) on resource-rich language pairs; then, use the neural network parameters of the parent model to initialize and constrain the training of the model for low-resource language pairs (child model).
Meta-learning algorithms are the most effective algorithms for “quickly adapting to new data.” Gu et al. were the first to apply meta-learning algorithms to neural machine translation tasks. The basic idea is to first train a high-performance neural machine translation model on multiple pairs of resource-rich language parallel corpora; then, construct a vocabulary for all languages; finally, initialize the neural machine translation model for low-resource languages based on the vocabulary and model parameters. It can be seen that the meta-learning model is more like a further deepening of the application of transfer learning in neural machine translation, and experimental results also show that meta-learning neural machine translation performs well on low-resource language pairs.
Although supervised methods have achieved good performance in many translation tasks, they also have limitations. Firstly, the ultimate performance of this method largely depends on the quality of machine translation trained on existing parallel corpora, making it unsuitable for zero-resource languages (translation tasks without parallel corpora); secondly, this method faces some language adaptation issues in low-resource language machine translation tasks, where the characteristics of the languages themselves can impact the final model performance.
We initially attempted the supervised method using data augmentation techniques. Unlike traditional back translation methods, we proposed a data augmentation method based on semantically related word replacement, aiming to enrich the semantic information of sentences in both languages while maintaining the structure of the source and target language sentences to achieve corpus expansion. Specifically, we first train a word vector model for the target language to select semantically related words; then, based on the selected semantically related words, we find semantically similar replacement words using a language model; finally, we generate augmented bilingual parallel corpora based on the word alignment model; and correct grammar errors in the generated pseudo-corpus. Experimental results indicate that the proposed method based on semantically related word replacement can achieve or even exceed the baseline of current data augmentation techniques.
2.2 Semi-Supervised Methods in Neural Machine Translation for Low-Resource Languages
Semi-supervised methods in neural machine translation for low-resource languages refer to methods that do not directly use bilingual sentence pairs during the entire model training process but instead utilize bilingual corpora indirectly. Semi-supervised methods can be further divided into pivot methods and bilingual corpus mining methods.
The pivot method refers to using a third language as a pivot language (usually English) to build machine translation models between the pivot language and the source and target languages, thereby constructing a machine translation model between the source and target languages. The main steps of this method are to first train the neural machine translation models for the source language to pivot language (denoted as S-P) and pivot language to target language (denoted as P-T); then, use the S-P and P-T machine translation models to translate the source language into the target language, forming parallel corpora between the source and target languages; finally, train the machine translation model based on the constructed bilingual corpus between the source and target languages. The pivot-based neural machine translation method has achieved good performance in zero-resource language pairs (where there are no direct parallel corpora between the two languages) and is the primary method for zero-resource machine translation.
Bilingual corpus mining refers to a method of mining bilingual parallel corpora from large-scale internet data for machine translation. The main steps of this method are to extract large-scale corpora (non-parallel aligned corpora) from the internet; then, use existing methods (knowledge bases, multilingual sentence embeddings, etc.) to mine bilingual parallel corpora; finally, use the mined parallel corpora to build neural machine translation models. Bilingual corpus mining is a major means of constructing the parallel corpora required for machine translation and is a powerful tool for building machine translation for low-resource languages (especially those at risk of extinction).
Semi-supervised methods have achieved good results in mining parallel corpus resources, but they also have certain limitations: the quality of indirectly used parallel corpora cannot be guaranteed, leading to significant error propagation issues, meaning that the final performance of the neural machine translation model may be amplified as erroneous data accumulates.
We conducted research on neural machine translation based on pivot languages, drawing on traditional pivot machine translation methods, combined with dual translation models and model fusion techniques. Compared to traditional pivot translation involving multiple iterations, our goal is to minimize the cumulative errors from multiple translations. Experiments in translation tasks involving Estonian, Latvian, Romanian, and Chinese showed that our proposed method significantly outperformed traditional iterative translation methods.
2.3 Unsupervised Methods in Neural Machine Translation for Low-Resource Languages
Unsupervised methods in neural machine translation for low-resource languages refer to methods that do not use bilingual sentence pairs during the entire model training process, relying solely on monolingual data from the source and target languages to build neural machine translation models.
Unsupervised neural machine translation methods involve building neural machine translation models using only monolingual data from both languages. This method generally involves three steps: first, using large-scale monolingual data to train cross-lingual word embeddings, initializing a machine translation model from the source language to the target language based on these embeddings; second, training language models for both source and target languages separately using large-scale monolingual data as denoising autoencoders; and third, using back translation to transform the unsupervised machine translation problem into a supervised one and iterating multiple times. The introduction of unsupervised neural machine translation methods has caused a stir in the field, overturning the traditional limitation that machine translation training must depend on parallel corpora, and has achieved good performance in language pairs that are relatively similar (e.g., English and German).
Unsupervised neural machine translation has somewhat disrupted people’s understanding of machine translation research, achieving good results in zero-resource machine translation tasks, but experiments show that this method performs well only in pairs of similar languages, while its performance in distant language pairs is poor.
We improved the translation model based on the latest proposed unsupervised neural machine translation model. Specifically, we first trained a multilingual sentence embedding model on a large-scale Wikipedia corpus to mine parallel corpora; then, we extracted bilingual dictionaries from the mined parallel corpora to guide the generation of cross-lingual word embeddings as supervisory signals; finally, we initialized the unsupervised neural machine translation model with the trained cross-lingual word embeddings. Experimental results from translation tasks involving Arabic, Russian, Portuguese, Hindi, and Chinese indicated that the proposed method could enhance the performance of unsupervised cross-lingual word embeddings, but due to significant differences between these languages and Chinese, the final improvement in unsupervised neural machine translation was only marginal.
3 Conclusion
Currently, neural machine translation for low-resource languages is a hot research topic in the field of machine translation, with considerable room for future development. Existing methods categorize low-resource language neural machine translation into supervised, semi-supervised, and unsupervised methods based on corpus usage. Of course, in addition to categorizing based on corpus usage, exploring new methods and models in other low-resource contexts (such as multi-domain low-resource, low-resource morphologically rich languages, distant low-resource language pairs, and low-resource multilingual scenarios) is also a worthwhile task for in-depth research.
(References omitted)
Selected from “Chinese Association for Artificial Intelligence Communications”
2022, Vol. 12, No. 3
Special Issue on Multilingual Intelligent Information Processing

Scan to Join Us