Human-Machine Collaboration in Interpretation

In the past century, information and communication technology has empowered interpretation to achieve continuous breakthroughs. In the 1920s, the advent of wired voice transmission technology enabled simultaneous interpretation systems where interpreters could listen and translate in real-time, revolutionizing traditional interpretation working models. In the 1960s and 70s, the emergence of computers and the internet reshaped the acquisition of language and knowledge resources in interpretation, marking the beginning of the era of computer-assisted interpretation. From the 1980s to the early 21st century, with the revival and development of machine translation, machine interpretation gradually became a research hotspot, mainly used in dialogue interpretation scenarios such as booking schedules and travel information services. In recent years, with the rapid development of big data and artificial intelligence, technologies such as speech recognition and natural language processing have made significant strides, coupled with the disruptive breakthroughs in neural network machine translation, greatly enhancing machine interpretation capabilities. Whether machine interpretation will replace human interpretation has become a topic of concern. Analyzing and comparing the processes, capabilities, quality, and effects of human and machine interpretation, as well as clarifying their respective advantages and disadvantages, is particularly necessary.
Human-Machine Collaboration in Interpretation

Comparison of Processing Processes

Machine interpretation refers to the real-time automated speech translation of source language by a computer system, which includes three modules: speech recognition, machine translation, and speech synthesis, completed by source language acoustic models, source language language models, and source language boundary models trained on corpora and machine learning, as well as machine translation models and target language models, and speech synthesis models. Since the various models of machine interpretation depend on the training and learning of natural discourse and human translation corpora, machine interpretation is essentially based on fixed usages or translations in a limited context and situation, realizing source language recognition, language conversion, and target language output. Human interpretation, on the other hand, is based on human intelligence and integrates knowledge, experience, context, and situational factors, utilizing various cognitive resources for source language comprehension, information storage (memory, recording), language conversion, and target language delivery in a dynamic cognitive process.
There are fundamental differences between human and machine interpretation in terms of processing objects, processing levels, processing paths and mechanisms, and strategy application. First, the processing object. Machines process source language speech and the recognized text; interpreters process source language speech (including various prosodic information), written or electronic reference materials and scripts, PPTs, video materials, as well as various visual and auditory information such as the speaker’s expressions, gestures, and body language, and contextual information. Second, the processing level. Machine speech recognition highly relies on speech, and during translation, it heavily depends on textual language structure, primarily performing shallow processing of language forms and semantics from the bottom up, making it difficult to conduct pragmatic processing. Interpreters combine bottom-up processing driven by source language and top-down processing driven by knowledge to perform multi-level and in-depth processing of semantics and pragmatics. Third, processing paths and mechanisms. The interpretation process involves both vertical and horizontal processing paths. In vertical processing, source language understanding and target language conversion are independent monolingual processing systems/processes, where the source language undergoes phonetic, lexical, syntactic, semantic, and pragmatic analysis to concept representation before target language conversion. In horizontal processing, source language understanding and target language conversion belong to a bilingual processing system/process, occurring simultaneously without going through concept mediation but seeking corresponding bilingual representations through shared bilingual representations. Machines rely on static corpus training, and their processing path during translation is singular, employing bilingual vocabulary and corresponding sentence structures for horizontal transcoding processing, such as transcription, word-for-word translation, and memory pairing. Interpreters can perform both horizontal transcoding and call upon various cognitive resources for vertical unwrapping (detaching from the source language form) understanding and expression through concept mediation, conducting propositionalization and propositional reconstruction operations. Fourth, strategy application. Machines operate programmatically based on training corpora and models and generally lack the ability to apply strategies. Interpreters can flexibly apply various strategies based on factors such as source language characteristics and situational context, including content strategies: omission, addition, paraphrasing; language strategies: splitting, merging, open structures; delivery strategies: waiting, occupying, prosody application; and comprehensive strategies: prediction, manifestation, softening, etc.
Human-Machine Collaboration in Interpretation

Core Competency Comparison

The core competencies of interpretation include three areas: message processing ability, richness, availability, and flexibility of language resources, and verbal fluency. These three abilities correspond to three aspects of the interpretation process: source language listening comprehension, bilingual conversion, and target language delivery; they also correspond to three dimensions of interpretation quality: content accuracy and completeness, linguistic correctness, authenticity, clarity, and efficiency, and clear, smooth, and rhythmic delivery. Additionally, interpretation competence includes information storage ability, communication ability, strategy application ability, rapid learning ability, and stress resistance ability. Below are the specific advantages and disadvantages of interpreters and machines in interpretation competence.
Interpreters have rich cognitive resources, deep understanding, and high tolerance (ambiguity); they express efficiently and diversely, with high usability; adept at using strategies; capable of undertaking complex translation tasks (simultaneous and consecutive interpretation, escort translation, etc.); and possess interpersonal communication skills, identity positioning, and emotional warmth. However, interpreters have a higher usage cost; their long-term memory (terminology and other resources) and short-term/work memory (when information density is high or cognitive energy is insufficient) are limited in capacity and usability; physiological and psychological factors, source language variables, and workload constraints affect translation performance.
Machine interpretation has low overall costs; it has vast storage capacity, higher learning efficiency; faster processing speed, and is fully automated, not constrained by physiological or psychological factors, and generally not affected by workload, leading to more stable performance. However, machine interpretation also has the following shortcomings: it struggles to process prosodic information and multimodal information outside the source language; speech recognition heavily depends on the speech itself, being highly sensitive to audio quality, pronunciation standardization, and typicality, with low tolerance (ambiguity) and difficulties in recognizing homophones; limited by training corpora, it faces challenges in recognizing and translating new terms, proper nouns, low-frequency words, complex structures, and colloquial expressions; its translation predominantly relies on transcoding, lacking the ability to process pragmatic information, emotional information, and metaphoric expressions; speech synthesis or subtitle display lacks the natural prosody needed to highlight information, leading to insufficient usability of the target language; diversity in the target language is lacking; on-site strategy capabilities are insufficient; and it cannot undertake complex translation tasks or communicate with speakers and audiences in real-time.
Human-Machine Collaboration in Interpretation

Quality and Effect Comparison

Interpretation quality assessment generally involves three levels: inter-linguistic level (comparison between source and target languages), intra-linguistic level (target language sound, language, and logic, etc.), and tool level (target language comprehensibility and usability). Inter-linguistic level assessment, also known as fidelity assessment, is a product perspective evaluation, assessing the consistency of the target language with the source language content/meaning (accuracy and completeness of the target language). Intra-linguistic and tool level assessments are communication perspective evaluations, focusing on the correctness of target language expression and delivery and the usability for interpretation users (or audiences). Factors affecting interpretation quality include source language variables, situational variables, language pair factors, directional factors, and interpreter factors. Source language factors have a significant impact, which can be further subdivided into: accent, pronunciation, prosody, etc., publication phonetic factors; speech rate, information density, etc., publication timing factors; expressiveness, clarity, complexity, etc., language factors; and professionalism, etc., content factors.
The author found through preliminary empirical research that: machines are extremely sensitive to publication phonetics, while interpreters are somewhat sensitive; machines are more sensitive to language (formality, normativity, clarity, complexity, flexibility), while interpreters are less sensitive; machines have lower sensitivity to content (professional knowledge level, etc.), while interpreters are more sensitive; and machines show no significant sensitivity to publication timing factors (speech rate, propositional information density, information component density, etc.), while interpreters are extremely sensitive.
Major issues with machine interpretation include: speech or vocabulary recognition errors, syntactic, semantic, or pragmatic recognition errors, translation errors (word-for-word translation), and non-fluent or non-normative expressions that are not filtered, generally caused by insufficient cognitive abilities. In terms of effects, users feel that the machine-generated speech synthesis target language is not yet natural or fluent enough, making it difficult to highlight and convey information through appropriate prosody, affecting the usability of the target language; when subtitles display the target language, audiences may find it challenging to read the translation while watching the speaker or PPT. Meanwhile, interpreter issues manifest as insufficient cognitive abilities and cognitive energy, mainly including difficulties in processing numbers, proper nouns, culturally loaded words, and terminology, as well as difficulties in understanding and converting sentences. In simultaneous interpretation characterized by listening and translating in real-time with time constraints, insufficient cognitive energy of interpreters leads to more common problems of omissions and mistranslations.
Human-Machine Collaboration in Interpretation

Prospects for Human-Machine Collaboration

The aforementioned comparative analysis indicates that there is significant complementary cooperation space between human and machine, making human-machine collaboration in interpretation (machine-assisted human translation and human-assisted machine translation) particularly promising.
In machine-assisted human translation, machines can leverage automation and large storage advantages to support human interpretation in the following scenarios through real-time transcription and/or translation prompts of source language: typical foreign languages with accents; fast speech rates with high information density (numbers, terminology, proper nouns, etc.); stable and fixed target language discourse, suitable for transcoding in political, industry, and technical contexts; reading scripts and speeches; and uncommon language pairs. Additionally, machine interpretation can assist human interpreters in preparing knowledge, language, and terminology before translation and in language conversion during interpretation.
In human-assisted machine translation, scenarios such as academic and industry technical conferences, ceremonies, and promotional events can primarily utilize machine translation, while humans leverage on-site multimodal information processing, high tolerance, pragmatic understanding, and on-the-spot strategy advantages to provide necessary noise reduction, correction, and adjustment interventions during machine interpretation, such as editing/preprocessing machine-generated results, correcting errors in speech decoding, chunk segmentation, and punctuation prediction, and dynamically adjusting target language content and form. Beyond professional practice, the real-time transcription and translation functions of machine interpretation can also be utilized in interpretation classroom teaching, simulated conference interpretation, and student self-study, empowering teachers to provide feedback on student performance.
Future research on machine interpretation should focus on simulating the advantageous mechanisms and capabilities of human interpretation, aiming for multi-scenario human-machine collaborative applications, shifting from single source language speech processing to multimodal information processing, from language processing to pragmatic information processing, from focusing on speech recognition and machine translation to learning and simulating human cognitive processes and abilities, from simple scenarios and single working modes to customizable complex scenarios and composite working modes, and from theoretical model construction to meeting the real and diverse communication needs of the market. Specifically, the following research areas should be strengthened: breaking through the current three-task separation model to achieve end-to-end real-time speech translation; multimodal information processing; high-tolerance (ambiguity) understanding of spontaneous speech; pragmatic information processing in understanding and conversion; development of high-quality, multi-feature training corpora; large-scale, market-oriented comparisons of human-machine simultaneous interpretation applications; exploration of human-machine collaboration paths and mechanisms in multiple scenarios and construction of human-machine collaboration models.
(This article is a phased result of the key project of the National Social Science Fund “Research on the Information Processing Path and Mechanism of Chinese-English Simultaneous Interpretation Based on Large Corpora” (22AYY005))
(The author is an associate professor at the Graduate School of Translation, Beijing Foreign Studies University, and the director of the Research Center for Interpretation Education and Practice.)

Leave a Comment