Legal Issues in Machine Translation: Insights

By / Tang Wei

Senior Legal Advisor, Tencent Legal Platform Department

Machine Translation (MachineTranslation, MT) is an interdisciplinary field that combines linguistics, computer science, and automation technology, referring to the engineering process of translating one language into another using computers, with the core being the computer processing of bilingual knowledge.From the perspective of artificial intelligence, machine translation simulates human acceptance, understanding, and analysis of language, then re-expresses it in another language. In 2017, the government issued the “13th Five-Year Plan for the Development of National Language and Writing” which proposed support for research and development of machine translation technologies and products across different languages. The development of computer science, linguistics (such as corpora), and artificial intelligence has also provided strong support for machine translation. In addition to policy encouragement, machine translation also faces legal issues regarding the ownership and use of corpora, the commercialization of voice in speech synthesis, and intellectual property protection.

I. Overview of Machine Translation

(1) Development History of Machine Translation

The development of machine translation technology has roughly gone through four stages: emergence, silence, revival, and development.

In 1946, American computational linguist Warren Weaver and British engineer Booth proposed the idea of using computers for automatic translation. In 1954, IBM conducted the world’s first machine translation demonstration using a computer. In 1966, the report “Languages and Machines” published by the U.S. Automatic Language Processing Advisory Committee denied the feasibility of machine translation, leading to a period of silence in machine translation research.

It wasn’t until the mid-1970s, with the development of computer hardware and software, that research in machine translation began to revive. Industry developed translation systems such as Weinder and EURPOTRAA. In 1993, IBM’s development of a word-aligned translation model marked the birth of statistical machine translation methods. In 2003, Edinburgh University’s Koehn proposed the phrase-based translation model, which was widely adopted.

With the advancement of artificial intelligence, research combining artificial intelligence and machine translation has also begun. Last year, Google announced a research project using only the attention mechanism to build models applied in machine translation, which sparked widespread discussion.

(2) Principles of Machine Translation Implementation

The working principle of machine translation is to first establish dictionaries and grammar rules based on linguistics, and then use computer technology to code and program these rules, with the core being to simulate the human translation process.

Currently, machine translation systems are mainly divided into two categories: rule-based (based on dictionaries and grammar rules) and corpus-based (which applies statistical models for translation based on annotated corpora, further divided into statistical and example-based machine translation).

Rule-based machine translation has the advantage of a high degree of abstraction in knowledge expression and a strong ability to maintain grammatical structure; however, its disadvantage lies in the fact that grammar and rules are written by humans, making consistency difficult to guarantee, and it lacks the ability to handle non-standard language phenomena.

Corpus-based machine translation has the advantage of strong automatic knowledge acquisition capabilities, large-scale corpus training improving system performance, and is good at handling ambiguous language phenomena; however, it requires a large-scale bilingual parallel corpus for training, with significant workload in corpus selection and processing.

In practical applications of machine translation, a combination of rule-based and corpus-based methods is generally used. The workflow of machine translation systems should include the following steps:

(1) Source text input;

(2) Lexical analysis processing;

(3) Syntactic analysis processing;

(4) Semantic analysis processing;

(5) Encyclopedic knowledge processing;

(6) Target conversion generation;

(7) Translated text output.

In the field of real-time speech translation, voice recognition occurs after voice input; correspondingly, speech synthesis will take place before translated text output. As an application technology of natural language processing, machine translation involves multiple disciplines and technologies including artificial intelligence, mathematics, linguistics, computational linguistics, speech recognition, and speech synthesis.

II. Legal Issues in the Development of Machine Translation and Responses

(1) Ownership and Use of Corpus Data

Corpus-based machine translation introduces statistical or instance-based corpora in the translation process, transforming the corpus into a language knowledge base through corpus processing techniques. In recent years, corpus-based machine translation systems have developed rapidly and achieved outstanding results.

Corpus-based machine translation requires large-scale bilingual parallel corpora for training, with high quality requirements for the corpora, and the accuracy of translation models and language parameters directly depends on the amount of corpus data; the quality of translation largely depends on the quality of the probabilistic models and the corpus’ quality and coverage capabilities.The translation performance of machine translation can be improved through large-scale corpus training, integrating more syntactic structures and semantic grammar information.

Moreover, translation memories can save and reuse previously translated texts, ensuring consistency and quality of translations, reducing redundant labor and lowering translation costs. Therefore, in corpus-based machine translation, the construction of the corpus holds a very important position, and the ownership and use of the corpus have become factors restricting the development of machine translation.

The construction of a corpus requires tremendous labor and effort, especially high-quality corpora that need to be manually selected and filtered in the initial stages. Currently, the construction of corpora in China mainly relies on higher education institutions and research organizations, as well as specialized corpus data companies, and some institutions are collecting and processing corpora through open platforms to enrich the corpus.

The essence of a corpus is data, and the recently published “General Principles of Civil Law” confirms the property status of data in law. As a form of property, the construction of a corpus requires attention to the legality of its ownership and use.Legal methods for corpus construction include self-construction, collaborative construction, negotiated transfer, or negotiated authorization; a corpus built through legal means can avoid rights defects during the development process of machine translation to the greatest extent. Scraping data from the internet essentially constitutes the unauthorized use of content with legally protected rights, which may face legal risks.

(2) Rights Related to Voice Commercialization in Speech Synthesis

Speech synthesis is the technology that converts text into speech through mechanical and electronic means, involving the machine’s understanding of natural language, prosody processing, and voice synthesis. In the speech products of machine translation, the translation results are output through speech synthesis technology. To achieve the results of speech synthesis, it is necessary to extract the prosody of a specific person’s voice and control the prosody of the synthesized voice; extract the voiceprint of a specific person and synthesize based on the voiceprint combined with prosody.

In speech synthesis technology, there arises the question of whether extracting a specific person’s voiceprint for speech synthesis requires consent.Especially for public figures, voices have a distinct recognizability and can easily be exploited by others. In some countries or regions abroad, specific legal protections for voice have been established, such as in California and Nevada in the U.S., France, Quebec in Canada, and the Macao region of China. However, there is currently no explicit legal provision for this so-called “voice right” in China; some scholars suggest that it can be protected through general personality rights, establishing voice rights, image rights, or commercialization rights; however, this issue is still in the theoretical discussion stage, and no relevant cases have emerged in judicial practice.

Although there is controversy over how to protect this, scholars unanimously agree on the necessity of legal protection for voices. In products, if a celebrity’s voice is to be used, authorization should be obtained through a contract; if it is synthesized speech, care should be taken to avoid similarity to a celebrity’s voice.

(3) Establishing Patent Protection and Technical Standards

With industrial development, the overall demand for the internationalization of machine translation technology in language information processing has formed. Applying for patents for technological achievements to protect them and subsequently establishing corresponding technical standards is a common practice internationally.

In the field of artificial intelligence, the European Union has proposed establishing unified technical standards within member states to guide the development and application of artificial intelligence, avoiding fragmentation and redundant construction within the EU market. The United States has also issued reports such as “Preparing for the Future of Artificial Intelligence” and “National Artificial Intelligence Research and Development Strategic Plan,” proposing the establishment of unified standards for technology, data usage, and safety to avoid closed construction issues among participants that could affect the development and application of artificial intelligence.

Companies like Google, Facebook, IBM, Microsoft, and Amazon are attempting to jointly formulate a series of standards regarding artificial intelligence. Companies leading in machine translation technology have already converted their technologies into patents, with hundreds of patented technologies existing in statistical machine translation, and neural network machine translation also requiring significant research results to achieve similar quality; for example, Google has applied for a patent for a neural machine keyword processing translation system.

Establishing unified technical standards, data usage standards, and safety assurance standards will be the direction of future machine translation development.Therefore, while patenting technologies, attention should be paid to avoiding patent infringement and participating in discussions on technical standards, as this is the future path.

(4) Protection of Personal Information

According to the “Personal Information Security Specification,” various information recorded electronically or otherwise that can identify a specific natural person alone or in combination with other information, or reflect specific natural person activity situations, must have corresponding protective measures established during the information collection, storage, and use processes of machine translation, and must not violate the basic principles involving privacy and data protection.

China’s “Cybersecurity Law,” “General Principles of Civil Law,” and other laws and regulations have specifically stipulated issues regarding the strengthening of personal information protection. Network service providers collecting and using users’ personal information while providing services must adhere to the principles of legality, legitimacy, and necessity, and must not collect personal information beyond what is necessary for providing services or use the information for purposes outside of providing services, nor collect or use users’ personal information through deception, misguidance, or coercion.

In industry practice, some translation companies have already received public attention due to personal information issues. Last year, the Norwegian news agency NRK reported a data breach incident involving the Norwegian National Oil Company (Statoil) due to the use of an online translation tool (translate.com). The Translate.com website indexes files for search engines during the provision of translation services, making them publicly searchable by anyone.

From the perspective of protecting users’ personal information, machine translation services should avoid collecting users’ personal information, as the service content itself does not necessitate the collection of such information. If it is necessary to obtain users’ translation results to enrich the corpus, it can be collected after processing to ensure anonymity, and with user consent.

Additionally, some have suggested that with the enhancement of hardware capabilities, consideration could be given to placing the specific module responsible for the translation process from the server to the client, conducting the translation computation on the client side to avoid uploading to the server, thus reducing the risk of personal information leakage.

III. Conclusion

In the rapid process of globalization, machine translation technology is gradually changing the way people work and live. Machine translation can help us solve language barriers anytime and anywhere, enabling free communication among people who speak different languages. The legal issues that machine translation development needs to face are also beginning to emerge, and whether the current legal system can provide the framework needed for the development of machine translation deserves continued attention.

Editor-in-Chief: Ma Ce

Editor: Asepirin

Cover Image Source: pixabay.com

Leave a Comment Cancel reply