Understanding the Principles of Machine Translation

Abstract :This article summarizes the basic principles and classifications of machine translation.

Keywords: Machine Translation Working Principles Classification

Machine Translation (Machine Translation,MT) is a comprehensive discipline based on multiple fields. The development of modern theoretical linguistics, advances in computer science, and the application of information science and probability statistics have significantly influenced the development and evolution of machine translation. The fundamental idea of machine translation is to utilize computers to translate natural languages, with various machine translation systems employing different technologies and concepts; there are various classification methods for the diverse machine translation systems. This article provides an overview of machine translation systems based on their fundamental working principles.

1.Basic Types of Machine Translation Systems: Existing machine translation systems can be divided into three basic types based on their fundamental working principles: Rule-Based (Rule-Based) Machine Translation, Example-Based (Example-Based) Machine Translation, and Statistical (Statistical) Machine Translation.

1.1.Rule-Based Machine Translation Systems (Rule-Based Machine Translation, RBMT): The fundamental working principle is based on the assumption that the infinite sentences of a language can be derived from a finite set of rules. This machine translation method based on this assumption can be further divided into three categories: Direct Translation (Direct Translation), Interlingual Approach (Interlingual Approach), and Transfer Approach (Transfer Approach). They all require large bilingual dictionaries and rules for deriving source language, language transformation, and generating target language; their differences lie in the depth of language analysis conducted. For example, direct translation requires almost no language analysis, while the interlingual approach and transfer approach necessitate some degree of analysis of both the source and target languages.

1.1.1Direct Translation (Direct Translation): This translation method translates each word in the source text individually, arranging the translated text in the same order as the original. This was the earliest working method of rule-based machine translation. This method is simple and intuitive, but its drawbacks are also evident: the quality of the translations obtained using this method is often unsatisfactory. People have gradually stopped using this direct translation method.

1.1.2Interlingual Approach (Interlingual Approach): This translation method conducts thorough linguistic analysis of the source language text, converting it into an intermediate language expression, which is then further transformed into text that conforms to the grammar rules of the target language. This intermediate language is a non-natural language, meaning it is not a language used by people in any country or region; it is also an unambiguous form of expression. Furthermore, the intermediate language is not unique, as different systems may use different intermediate languages. Theoretically, this interlingual approach is the most efficient method of translation, as it requires only 2n modules to solve all mutual translation problems among n natural languages. Without using an intermediate language, mutual translations among these languages would require n(n-1) modules. When n exceeds 3, 2n is less than n(n-1). It is well-known that the number of natural languages in the world far exceeds 3, hence the number of 2n modules is much smaller than that of n(n-1) modules.

1.1.3Transfer Approach (Transfer Approach): This translation method first conducts a certain degree of linguistic analysis on the source language text, removing grammatical factors to generate an intermediate representation of the source language, which is then transformed into an intermediate representation of the target language, and finally generates and outputs text that conforms to the target language’s grammar rules from this intermediate representation. Currently, the linguistic analysis and implementation methods of the transfer approach are the most complex among the three methods, and the translation quality obtained is also the best, making it the most commercially successful method.

In many rule-based machine translation systems, linguists assist in writing a series of grammatical rules regarding the source and target languages, as well as transformation rules to convert source language data into target language data. However, creating these rules entirely manually is very expensive, time-consuming, and prone to errors. One solution is to use past historical translation results as a resource pool, where the source language text and its corresponding target language translation serve as examples to extract appropriate rules. One method involves manually tagging the source text and the target language translation to indicate their correlation.Sato and Nagao[1] developed a system that uses a “flat dependency tree” to represent the source language text and the target language text. This tree-structured data format is a form that computers can efficiently recognize. Typically, two levels are used to represent the correlation between the source language and the target language: the first level depends on the surface form of the text (such as word and phrase order), used for analyzing the source language and generating the target language; the second level depends on the semantic relationships between words, used for converting from the source language to the target language. This machine translation system leverages the advantages of the resource pool based on rule-based machine translation.

With the accumulation of a large number of historical translation results, example-based machine translation systems have emerged, utilizing these completed translation results as a resource pool in machine translation.

1.2.Example-Based Machine Translation (Example-Based Machine Translation,EBMT): Its fundamental working principle is based on the principle of analogy (Analogy), matching the text fragments most similar to the source text from the resource pool, extracting the corresponding target language translation results of the example text fragments, making appropriate modifications, and ultimately deriving the complete translation result. The core idea of example-based machine translation was first proposed by Mako Nagao [2], who suggested that when translating simple sentences, people do not conduct deep linguistic analysis but rather translate. The source sentence is first decomposed into several fragments, which are then translated into the target language, with each fragment’s translation obtained through matching with example sentences based on analogy, and finally, these translated sentences are combined into a longer sentence.

1.2.1.Composition of the Example Library: The example library, also known as the corpus (Corpus), consists of completed translation results. These ready-made translation results are also referred to as corpora, including results from human translations and machine translations that have been manually edited. The corpus consists of bilingual pairs, including segments of source language text and segments of target language translations. These translation results must first undergo segmentation and alignment processing to become usable materials in the corpus. Therefore, the corpus is also referred to as a parallel bilingual corpus (Parallel Corpus). Segmentation and alignment currently have various forms, such as sentence-level alignment and phrase-level alignment. The choice of the size of aligned text segments directly affects the efficiency of matching and the quality of translation.

1.2.2.Fragmentation Issues in Corpus Segmentation: Nirenburg et al. (1993) pointed out that in example-based machine translation systems (EBMT), there exists a contradiction between the length of text segments and similarity. The longer the text segment, the less likely it is to achieve a high similarity match; the shorter the text segment, the more likely it is to achieve a rough match, but the risk of obtaining low-quality translation results increases. For instance, issues of overlap caused by paragraph boundary definitions and inappropriate segmentation lead to a decline in translation quality. Intuitively, it seems better to select a corpus segmented by sentence units, which has many advantages such as clear boundaries and straightforward structures for simple sentences. However, in practical applications, segmenting by sentence units is not the most appropriate method. Empirical evidence shows that the matching and reassembly process requires the use of shorter segments[3]. (Of course, these research results are based on translation studies between languages of the European and American language families.)

1.2.3.Customization of the Example Library: The scope and quality of the example corpus significantly affect the translation quality of example-based machine translation systems (EBMT). Obtaining high-quality corpus in a specific field can greatly enhance the translation quality of machine translation in that field, referred to as the customization of the corpus (example) library.

1.3.Statistical Machine Translation Systems(Statistical MT): IBM’s Brown was the first to apply statistical models to French-English machine translation in 1990. The fundamental idea is to view the machine translation problem as a noise channel problem and then use a channel model for decoding. The translation process is considered a decoding process, transforming it into a quest for the optimal translation result. The focus of machine translation based on this idea is to define the most suitable language probability model and translation probability model, and then estimate the probability parameters of the language model and translation model. The parameter estimation of the language model requires a large amount of monolingual corpus, while the parameter estimation of the translation model requires a large amount of parallel bilingual corpus. The quality of statistical machine translation largely depends on the performance of the language model and translation model; furthermore, to find the optimal translation, good search algorithms are also necessary. In simple terms, statistical machine translation first establishes a statistical model, then trains the statistical model using examples from the resource pool to obtain the required language model and translation model for translation.

In addition to systems based on noise channel theory, there are also systems based on the maximum entropy method. Berger (A.L.Berger) proposed the “maximum entropy method” (Maximum Entropy Approach) in natural language processing in 1996. German researcher Och and others discovered that transforming the translation model in IBM’s statistical machine translation basic equation into a reverse translation model did not reduce the overall translation accuracy, leading them to propose a machine translation model based on the maximum entropy method.

Statistical machine translation has achieved certain results; however, purely statistical designs cannot solve all difficulties. Statistical methods do not consider semantic and grammatical factors of language, relying solely on mathematical methods to address language issues, which has significant limitations. Consequently, researchers began exploring the combined application of statistical methods and other translation methods, such as systems that integrate statistical and example-based machine translation, and statistical and rule-based machine translation systems, etc.

2.Comprehensive Types of Machine Translation Systems: The three basic machine translation systems mentioned above each have their advantages and strengths, while also inevitably possessing certain defects and limitations. For instance, rule-based machine translation systems (RBMT) can accurately describe linguistic features and rules, yet formulating applicable and comprehensive language rules is not an easy task; example-based machine translation systems (EBMT) can fully utilize existing translation results, but maintaining the example library requires substantial manual effort and costs; statistical machine translation (Statistical MT) can alleviate the bottleneck issue of knowledge acquisition, but purely mathematical methods struggle to fully address the complexities of language. To further improve the translation level of machine translation systems, researchers have combined the advantages of the above basic types, leading to the invention of hybrid machine translation systems (Hybrid MT), multi-engine machine translation systems (Multi-Engine MT), and theories surrounding knowledge-based machine translation systems (Knowledge-Based MT).

2.1.Hybrid Machine Translation Systems (Hybrid MT): The translation process employs two or more machine translation principles. For example, the core of rule-based machine translation methods is to construct a complete and adaptive rule system. How to obtain a complete and adaptive rule system has become a research focus. Using traditional methods, establishing a grammar rule library requires significant human and material resources, and there are often unavoidable conflicts between numerous language grammar rules, making it difficult to ensure the completeness and adaptability of the rules. As translation work progresses, a large number of completed translation results are generated, forming a substantial corpus. Researchers thought of using statistical methods to automatically extract the linguistic grammar information needed from the existing corpus. By extracting language transformation rules from examples, example-based machine translation is used as a research technique to establish the foundation for language rules, rather than solely for analogy translation. Through an inductive process, abstract rules are proposed from a large number of example sentences[4][5]. This way, traditional rule-based machine translation methods evolve into machine translation methods based on rules, supported by corpora. This translation model can be referred to as hybrid machine translation systems (Hybrid MT).

2.2.Multi-Engine Machine Translation Systems (Multi-Engine MT): The basic idea of this machine translation system is to conduct parallel translations using several machine translation engines, each based on different working principles, providing multiple translation results, which are then filtered and synthesized through a mechanism or algorithm to generate the optimal translation result for output. One operational method of multi-engine machine translation systems is as follows: upon receiving the source text, the text is first converted into several text segments, which are then translated in parallel by multiple machine translation engines, with each text segment yielding multiple translation results. These results are then selected through a mechanism to form the optimal combination, ultimately outputting the best translation result[6]. Alternatively, upon receiving the source text, multiple machine translation engines perform parallel translations to yield multiple translation results, which are then compared word by word, selecting appropriate word translations through hypothesis testing and algorithms to form the optimal translation result for output[7].

2.3.Knowledge-Based Machine Translation Systems (Knowledge-Based MT): In machine translation research, it has become increasingly evident that correctly understanding and comprehending the source language is crucial in the translation process. Language possesses complexity, with its ambiguity being the most stubborn challenge faced by various machine translation systems. Linguistic ambiguity refers to a situation where the same surface structure of language corresponds to two or more deep structures; simply put, one form corresponds to two or more interpretations, which can only be accurately interpreted through contextual cues and the accumulation and processing of world knowledge and common sense. Influenced by developments in artificial intelligence and knowledge engineering, researchers have begun to emphasize a more thorough understanding of the source language, proposing the need for not only deep linguistic analysis but also the accumulation and processing of world knowledge, establishing knowledge bases to aid in understanding language. By comprehending world knowledge, the ambiguity issues encountered in machine translation can be resolved. To fundamentally address the ambiguity of language that machine translation faces, knowledge-based machine translation systems have been proposed.

2.3.1Semantic Web-Based Machine Translation (Semantic Web based Machine Translation, SWMT): This is an implementation method of knowledge-based machine translation systems. The Semantic Web refers to a technology that transforms existing knowledge content on the web into machine-readable content, becoming the “world knowledge base” for machine translation. These theories are based on Tim Berners-Lee’s notion that “once knowledge is defined and formalized, it can be accessed in any way.” The original design of the World Wide Web aimed for simplicity, decentralization, and ease of interaction. The development of the web has proven to be a tremendous success. However, the information on the web is primarily directed towards the human brain. To enable computers to also accept and utilize these information resources, a new technology emerged in the new century, referred to as W3C, Semantic Web3 (three-dimensional semantic web). The foundational technology of the three-dimensional semantic web is the data format of “Resource Description Framework” (RDF), which defines a structure for describing the vast amounts of data processed by computers in a natural way[8]. Currently, efforts are being made to integrate existing machine translation systems into the semantic web to fully leverage world knowledge/expert knowledge and enhance machine translation quality[9].

3.Speech Translation: Speech translation is a classification of machine translation that corresponds to text translation, differing from the previous classifications. However, it has widespread applications, such as automatic translation of spoken content in daily conversations, phone calls, and conference speeches, making it very important in practical applications. Speech translation adds a speech recognition (Speech Recognition) process before translation to form accurate text input, and includes a speech synthesis (Speech Synthesis) process after the translation process to produce accurate spoken output. Both speech recognition technology and speech synthesis technology have dedicated research, which will not be elaborated here.

Author Names: Hong Jie

Work Unit: Transn IOL Technology Co., Ltd. Multilingual Engineering Center

Author Names: Hong Lei

Work Unit: University of Chinese Academy of Sciences, Department of Foreign Languages

(Article from the public account “Language Network Fire Cloud Translator”)

Leave a Comment Cancel reply