1.Basic Types of Machine Translation Systems:
Existing machine translation systems can be classified into three basic types based on their fundamental working principles: Rule-Based machine translation, Example-Based machine translation, and Statistical machine translation.
1.1.Rule-Based Machine Translation Systems
(Rule-Based Machine Translation, RBMT):
The basic working principle is based on the assumption that an infinite number of sentences in a language can be derived from a finite set of rules.
Based on this assumption, machine translation methods can be further divided into three categories: Direct Translation, Interlingual Approach, and Transfer Approach.
All of these methods require a large bilingual dictionary, source language derivation rules, language transformation rules, and target language generation rules; their differences lie in the depth of linguistic analysis performed.
For instance, the direct translation method requires almost no linguistic analysis, while the interlingual and transfer approaches necessitate a certain degree of linguistic analysis of both the source and target languages.
1.1.1Direct Translation:
This translation method translates each word in the source text one by one, maintaining the original word order in the translated text.
This was the earliest method used in rule-based machine translation. While this approach is simple and intuitive, its drawbacks are also evident: the quality of translations produced by this method is often unsatisfactory. Consequently, this direct translation method is gradually being abandoned.
1.1.2Interlingual Approach:
This translation method conducts thorough linguistic analysis of the source language text, transforming it into an intermediate language representation, which is then used to generate and output text that adheres to the grammatical rules of the target language.
This intermediate language is a non-natural language, meaning it is not spoken by people in any country or region; moreover, it is an unambiguous form of expression. Additionally, the intermediate language is not unique; different systems may employ different intermediate languages.
Theoretically, any language can be translated into any other language using an intermediate language, making this approach the most efficient method of translation.
Assuming there are a total of n natural languages in the world, using the interlingual approach would require only 2n modules to resolve all mutual translations between these natural languages.
Without using an intermediate language, n(n-1) modules would be necessary for mutual translations among these languages. When n exceeds 3, 2n is less than n(n-1). It is well-known that the number of natural languages in the world far exceeds 3, hence 2n modules are significantly fewer than n(n-1) modules.
1.1.3Transfer Approach:
This translation method first conducts a certain degree of linguistic analysis on the source language text to eliminate grammatical factors, generating an intermediate representation of the source language. Then, through transformation, it generates an intermediate representation of the target language, which is finally used to produce text that conforms to the grammatical rules of the target language.
Currently, the linguistic analysis and implementation methods of the transfer approach are the most complex among the three methods, and the quality of translations produced is also the highest, making it the most commercially successful translation method.
In many rule-based machine translation systems, linguists assist in compiling a series of grammatical rules for the source and target languages, as well as transformation rules for converting source language data into target language data.
However, creating these rules entirely by hand is very expensive, time-consuming, and prone to errors. One solution is to use historical translation results as a resource pool, taking the source language text and its corresponding target language translation as examples to extract appropriate rules. One method is to manually annotate the source text and target language translations to indicate their relationship.
Sato and Nagao developed a system that uses a “flat dependency tree” to represent the source and target language texts. This tree-like data structure is a form that computers can efficiently recognize.
Typically, two levels are used to represent the relationship between the source and target languages: the first level relies on the surface form of the text (such as word and phrase order) for analysis of the source language and generation of the target language; the second level relies on the semantic relationships between words for the transformation from the source language to the target language. This machine translation system leverages the advantages of the instance library based on rule-based machine translation.
As a large number of historical translation results accumulate, instance-based machine translation systems emerge, where completed translation results are used as resource pools for machine translation.
1.2.Example-Based Machine Translation (EBMT):
The basic working principle is based on the principle of analogy, matching the source text fragments with the most similar text fragments from the instance library, extracting the corresponding target language translation results, and making appropriate modifications to ultimately derive a complete translation result.
The core idea of example-based machine translation was first proposed by Mako Nagao, who suggested that people do not perform deep linguistic analysis when translating simple sentences but rather translate.
First, the source sentence is broken down into several fragments, and then these fragments are translated into the target language. Each fragment’s translation is obtained through matching with example sentences based on the principle of analogy, and finally, these translated sentences are combined into a long sentence.
1.2.1.Composition of the Instance Library:
The instance library, also known as a corpus, is composed of completed translation results. These ready-made translation results are also referred to as corpora, including both human translations and machine translations that have been edited by humans.
The corpus consists of bilingual pairs, including source language text fragments and corresponding target language translation fragments. These translation results must first undergo splitting and alignment processing before they can become usable corpus in the library. Therefore, the corpus is also referred to as a parallel bilingual corpus.
Splitting and alignment currently have various forms, such as sentence-level alignment and phrase-level alignment. The choice of the size of the aligned text fragments directly affects the efficiency of matching and the quality of translation.
1.2.2.Fragmentation Issues in Corpus Splitting:
Nirenburg et al. (1993) pointed out that in example-based machine translation systems (EBMT), there exists a contradiction between the length of text fragments and their similarity. Longer text fragments are less likely to yield a high-similarity match; shorter fragments are more likely to achieve a rough match, but the risk of obtaining low-quality translation results increases.
For instance, issues arising from paragraph boundary delineation can lead to overlaps and inappropriate divisions that result in decreased translation quality. Intuitively, it seems that using sentences as the unit of division yields better corpus, offering many advantages such as clear boundary delineation and straightforward structure for simple sentences.
However, in practical applications, using sentences as the unit is not always the most appropriate method. Practice has shown that the matching and recombination process requires the use of smaller fragments. (Of course, these research results are based on translation studies between languages of the European and American language families.)
1.2.3.Customization of Instance Libraries:
The scope and quality of instance corpora significantly impact the translation quality of example-based machine translation systems (EBMT). Acquiring high-quality corpora in specific domains can greatly enhance the translation quality of machine translation in those domains, known as the customization of the corpus (instance) library.
1.3.Statistical Machine Translation Systems:
IBM’s Brown first applied statistical models to French-English machine translation in 1990. The basic idea is to view the machine translation problem as a noise channel problem and then use a channel model for decoding. The translation process is regarded as a decoding process, transforming into a quest for the optimal translation result.
Based on this idea, statistical machine translation focuses on defining the most suitable language probability model and translation probability model, and then estimating the probability parameters of the language model and translation model.
The estimation of parameters for the language model requires a large amount of monolingual corpus, while the estimation of parameters for the translation model requires a large amount of parallel bilingual corpus. The quality of statistical machine translation largely depends on the performance of the language model and translation model; moreover, good search algorithms are also needed to find the optimal translation.
In simple terms, statistical machine translation first establishes a statistical model and then trains this model using instances from the instance library to obtain the required language and translation models for translation.
In addition to systems based on noise channel theory, there are also systems based on the maximum entropy method. Berger proposed the “Maximum Entropy Approach” in natural language processing in 1996.
German researcher Franz Joseph Och and others found that transforming the translation model in IBM’s statistical machine translation basic equation into a reverse translation model did not reduce the overall translation accuracy. Consequently, they proposed a machine translation model based on the maximum entropy method.
Statistical machine translation has achieved certain successes; however, purely statistical designs cannot resolve all difficulties. Statistical methods do not consider semantic and grammatical factors of language, relying solely on mathematical methods to address language issues, which presents significant limitations.
As a result, researchers began to explore the combined application of statistical methods and other translation methods, such as combining statistical and example-based machine translation systems, as well as statistical and rule-based machine translation systems, etc.
2.Comprehensive Types of Machine Translation Systems:
The three basic machine translation systems mentioned above each have their advantages and strengths, while inevitably possessing certain defects and limitations.
For example, rule-based machine translation systems (RBMT) can accurately describe linguistic features and rules; however, establishing applicable and comprehensive language rules is not an easy task;
Example-based machine translation systems (EBMT) can fully utilize existing translation results, but maintaining the instance library requires substantial manpower and costs;
Statistical machine translation (Statistical MT) can alleviate the bottleneck of knowledge acquisition, but purely mathematical methods struggle to fully address the complexities of language.
To further enhance the translation quality of machine translation systems, researchers have combined the advantages of the above basic types, leading to the invention of hybrid machine translation systems, multi-engine machine translation systems, and the proposal of knowledge-based machine translation systems.
2.1Hybrid Machine Translation Systems:
The translation process employs two or more machine translation principles. For example, the core of rule-based machine translation methods is to construct a complete and adaptable rule system. How to obtain a comprehensive and adaptable rule system has become a focal point of research.
Using traditional methods, establishing a grammar rule library requires substantial human and material resources, and conflicts often arise between numerous linguistic grammar rules, making it difficult to ensure the completeness and adaptability of the rules.
As translation work progresses, a large number of completed translation results are generated, forming a vast corpus. Researchers have considered using statistical methods to automatically extract the necessary linguistic grammar information from existing corpora. They aim to extract language transformation rules from instances, establishing instance-based machine translation as a research technique to build a linguistic rule foundation rather than merely for analogy-based translation.
Through an inductive process, abstract rules are proposed from numerous example sentences. This way, traditional rule-based machine translation methods evolve into a model based on rules, with corpora as auxiliary support. This translation model can be referred to as a hybrid machine translation system.
2.2Multi-Engine Machine Translation Systems:
The basic idea of this machine translation system is to conduct parallel translation using several machine translation engines simultaneously. These engines, based on different working principles, produce multiple translation results, which are then filtered through a mechanism or algorithm to generate the optimal translation result for output.
One operational mode of multi-engine machine translation systems involves receiving source text, converting it into several text fragments, and having multiple machine translation engines perform parallel translations. Each text fragment receives multiple translation results, and the optimal translation fragments are selected through a certain mechanism to form the best combination, ultimately outputting the optimal translation result.
Alternatively, upon receiving the source text, multiple machine translation engines conduct parallel translations to yield multiple translation results, which are then compared word by word, using hypothesis testing and algorithms to select appropriate word translations to compose the optimal translation result.
2.3.Knowledge-Based Machine Translation Systems:
In machine translation research, there is an increasing recognition of the importance of correctly understanding and comprehending the source language during the translation process. Language is inherently complex, and its ambiguity presents one of the most stubborn challenges faced by various machine translation systems.
Language ambiguity refers to the situation where the same surface structure of language corresponds to two or more deep structures. In simple terms, one form can correspond to two or more interpretations, which must be correctly interpreted with the help of contextual clues and a comprehensive background of knowledge and common sense.
Influenced by developments in artificial intelligence and knowledge engineering, researchers have begun to emphasize a more thorough understanding of the source language, proposing that not only deep linguistic analysis is needed, but also the accumulation and processing of world knowledge to establish a knowledge base that aids in understanding language.
By understanding world knowledge, the ambiguity of language encountered in machine translation can be resolved. To fundamentally address the issue of language ambiguity faced by machine translation, researchers have proposed knowledge-based machine translation systems.
2.3.1Semantic Web Based Machine Translation:
This is one implementation of knowledge-based machine translation systems. The Semantic Web refers to a technology that transforms existing knowledge content on the web into machine-readable content, becoming the “world knowledge base” for machine translation.
These theories are based on Tim Berners-Lee’s assertion that “once knowledge is defined and formalized, it can be accessed in any way.” The original design of the World Wide Web aimed for simplicity, decentralization, and as much interactivity as possible.
The development of the web has proven to be a tremendous success. However, the information on the web is primarily aimed at the human brain. To enable computers to also accept and utilize these information resources, a new technology emerged in the new century, known as W3C, or Semantic Web 3 (Three-Dimensional Semantic Web).
The foundational technology of the three-dimensional semantic web is the data format “Resource Description Framework” (RDF), which defines a structure that describes vast amounts of data processed by computers in a natural way. Currently, efforts are being made to integrate existing machine translation systems into the semantic web to fully leverage world knowledge/expert knowledge and improve machine translation quality.
3.Speech Translation:Speech translation corresponds to a type of machine translation classification that is distinct from text translation. However, it has wide applications, such as automatic translation of spoken content during daily conversations, telephone calls, and conference speeches, which is very important in practical applications.
Speech translation adds a language recognition process before translation (Speech Recognition) to form correct textual input, and after the translation process is complete, it adds a speech synthesis process (Speech Synthesis) to produce correct spoken output. Both speech recognition and speech synthesis technologies have specialized research, which will not be elaborated here.