How to Evaluate Machine Translation and Post-Editing Quality?

This article is based on an interview with RWS Senior Solutions Architect Miklós Urbán.

1. Automatic Evaluation Metrics for Machine Translation Quality

Evaluating the quality of machine translation is crucial for improving machine translation quality. However, what are the best metrics for measuring the quality of machine translation (MT)?
There are two types of methods for measuring the quality of machine translation (MT): human evaluation and automatic evaluation. Human comprehensive evaluation is often the most effective solution, but it is also highly subjective, time-consuming, and expensive. Therefore, industry scholars have introduced standardized, automated metrics to measure MT performance. Many studies have shown that the results produced by these metrics can be quite similar to those of human evaluations.
With the emergence of neural machine translation (NMT), the demand for data-driven MT quality quantification methods has been growing. The output characteristics of NMT are significantly different from those of statistical machine translation (SMT), so researchers are looking for new metrics to more reliably assess the quality of neural machine translation.

Metric 1: BLEU

BLEU score is the first commonly used evaluation metric in the industry, which compares machine translation with human translation. Assuming a document is translated once by a human and once by a machine, the value of BLEU is the ratio of words that appear in both the machine translation and the human translation.
When BLEU became popular 10 to 15 years ago, it was believed that this method was the most similar to human quality evaluation. The method is widely used, despite its well-known limitations. For example, it does not handle synonyms or grammatical variations well, and it is also very unbalanced because it only compares from machine translation to human translation in one direction.

Metric 2: METEOR

The METEOR algorithm is more detailed because it not only compares machine translation and human translation in both directions but also considers factors such as language grammar. Unlike BLEU, METEOR takes into account the variability of language. In English, ‘ride’ or ‘riding’ would count as different words in the BLEU method, but in METEOR, they are counted as the same word because they share the same root.

2. Automatic Evaluation Metrics for Post-Editing Quality

An important part of evaluating post-editing is comparing the differences between the machine translation output and the post-edited translation, using metrics that count the number of changes, which include deletions, replacements, and additions of words. A formula is used to calculate the number of changes and provide a numerical result.
So, what are the commonly used methods for evaluating post-editing?

Levenshtein Distance Algorithm (Edit Distance Algorithm)

The Levenshtein Distance Algorithm (Edit Distance Algorithm) can calculate the differences between the machine translation output and the post-edited translation. For example, if the machine translation output is ‘the cat is barking’ and the post-editing changes it to ‘the dog is barking’, then the difference value is 6, because changing ‘cat’ to ‘dog’ involves deleting 3 letters, adding 3 letters, and then dividing the total number of letters by 6 to get a percentage result.

TER (Translation Edit Rate)

The difference between the TER method and the ‘Edit Distance Algorithm’ is that the ‘Edit Distance Algorithm’ calculates the number of changes at the character level (i.e., additions, deletions, replacements of characters), while the TER method is based on the number of edits, rather than the number of character changes, to calculate the type of changes in the translation.
In the example of ‘the cat is barking’ vs. ‘the dog is barking’, the ‘Edit Distance Algorithm’ counts both the 3 characters deleted and the 3 characters added; whereas TER only identifies one replacement: one string is replaced by another string, which has a length of three. Therefore, it counts as one edit of length three characters.
Thus, in cases where only one long edit has been made, Levenshtein may actually overestimate the extent of post-editing—for example, if you replace one or two characters throughout a long sentence, Levenshtein cannot distinguish between this edit and rewriting the entire sentence. In such cases, TER is more reliable because its logic aligns better with actual post-editing.

3. Quality Assessment of Machine Translation: Human Evaluation or Automatic Evaluation?

The purpose of automated evaluation is to mimic the results of human evaluation. However, ultimately, automated evaluation can only show the percentage difference between machine translation and human translation or post-edited translation.
In contrast, human evaluation can be more nuanced, allowing humans to provide a more detailed overview of machine translation quality. We typically use the TAUS DQF benchmark to guide human evaluation, during which we can gain a better understanding of different aspects of language quality, such as accuracy (the quality of information conveyed) and fluency (spelling and grammar), while the single number returned by automated metrics is easier to be accurate.
Fluency is harder to measure because it is subjective. However, we can develop automated metrics that detect co-occurring word phrases, known as n-grams (where ‘n’ represents the number of consecutive words). Theoretically, the longer the phrases that appear in the same order in machine translation and human translation, the more fluent the machine translation output.

-End-

Source: Translation Technology Salon

Leave a Comment