Evaluation of Large Model Translation Capabilities

1. Background Introduction

With the rapid development of natural language processing technology, machine translation has become an important research field. In recent years, large model-based language models have made significant progress in machine translation tasks. These large models usually have hundreds of millions or even hundreds of billions of parameters, enabling them to better understand and generate natural language.

However, there are hundreds or even thousands of large models available on the market, each with its own functional characteristics and applicable scenarios. How should we evaluate the translation performance of different models? There may be various solutions. This article attempts to use the WMT dataset + BLEU scoring mechanism to relatively comprehensively evaluate the translation capabilities of several large models.

First, let’s briefly supplement some basic knowledge.

The WMT Dataset

The WMT (Workshop on Machine Translation) dataset is a series of benchmark datasets used for machine translation, provided by the annual WMT conference. The WMT conference is an important international conference in the field of machine translation, held annually since 2006, aimed at promoting the development of machine translation technology.

The WMT dataset includes translation data for various language pairs, typically sourced from news articles, parliamentary records, books, and other publicly available text resources. These datasets are widely used for training, evaluating, and comparing different machine translation systems. Some well-known language pairs include English-French, English-German, English-Spanish, etc. WMT provides authoritative benchmark data that we can use to evaluate the translation accuracy of different models.

BLEU Scoring

BLEU (Bilingual Evaluation Understudy) scoring is an automatic evaluation metric used to assess the quality of machine translation output. BLEU was proposed by IBM in 2002 to provide a fast, objective, and low-cost method for evaluating the performance of translation systems. BLEU scoring has become one of the most widely used evaluation standards in the field of machine translation.

BLEU generates a score within the range of [0,1] by calculating metrics such as precision, modified precision, geometric mean, and final score, where 1 indicates perfect matching, meaning the machine translation output is completely consistent with the reference translation. By comparing the relative scores between different large models, we can evaluate the differences in translation capabilities.

With this basic knowledge, we can implement the specific evaluation program.

The following program uses the LangChain framework, taking the English-to-Chinese scenario as an example, to introduce the specific implementation process of translation evaluation.

2. Implementation Process

Loading the Corpus Dataset

First, install the datasets library:

pip install datasets

Then we implement a DataSetLoader to load the WMT corpus dataset. The wmt19 repository contains the English-Chinese translation dataset. We assume English as the source language and Chinese as the target translation language.

from datasets import load_dataset

class DataSetLoader:    """Dataset Loader"""
    def __init__(self):        """Initialization method"""
        # Load English to Chinese dataset        self.ds = load_dataset('wmt19', 'zh-en')        print("Loaded [en-zh] dataset successfully")
    def get_origin_content(self, idx: int) -> str:        """Get original content"""        return self.ds['train'][idx]['translation']['en']
    def get_ref_trans(self, idx: int) -> str:        """Get reference translation"""        return self.ds['train'][idx]['translation']['zh']

Calculating BLEU Scores

Next, we define a BleuScoreCalculator component to calculate the BLEU score. Here, we can directly use the nltk.translate package (which needs to be installed):

from nltk.translate.bleu_score import sentence_bleu

class BleuScoreCaculator:    """BLEU Score Calculator"""
    @staticmethod    def calc_score(references, hypothesis) -> float:        """Calculate BLEU score"""        return sentence_bleu(references, hypothesis, weights=(1,))

Tokenization Processing

In addition, to avoid the impact of different tokenization rules, we develop a tokenizer component that tokenizes text according to a uniform rule. The tokenizer library can use the widely used jieba:

from typing import List
import jieba

class Tokenizer:    """Tokenizer"""
    @staticmethod    def clean_and_tokenize(text: str) -> List[str]:        """        Clean text and tokenize        :param text: Original text        :return: Token list        """
        # Remove extra spaces and punctuation        trimmed = text.replace('\n', ' ').replace('  ', ' ').strip()
        # Use jieba for tokenization        return list(jieba.cut(trimmed))

Translation Evaluation

All the basic components are ready, and we can now complete the core translation evaluation functionality. We use the LangChain framework to construct a standardized processing chain.

We use the following three candidate models:

glm-4-plus
gpt-4o
qwen-32b

These three are currently very powerful models in the industry. So how do their translation capabilities compare? Let’s write some code to find out:

import jsonfrom typing import List, Dict, Any
import dotenvfrom langchain_community.chat_models import ChatZhipuAIfrom langchain_core.output_parsers import StrOutputParserfrom langchain_core.prompts import ChatPromptTemplatefrom langchain_openai import ChatOpenAI
from bleu import BleuScoreCaculatorfrom loader import DataSetLoaderfrom tokenizer import Tokenizer
if __name__ == '__main__':    # Load environment variables    dotenv.load_dotenv()
    # Compare 3 LLMs: glm-4-plus, gpt-4o and qwen-32b    chat_glm_4_plus = ChatZhipuAI(model="glm-4-plus", temperature=0.1)
    chat_gpt_4o = ChatOpenAI(model="gpt-4o", temperature=0.1)
    chat_qwen_32b = ChatOpenAI(model="qwen-32b", temperature=0.1)
    # Construct prompt    query = """    Content to be translated:    {content}
    Source language:    {origin_lang}
    Target language:    {target_lang}
    Please note: Generate the translated text directly, no additional information needed!    """    prompt = ChatPromptTemplate.from_messages([        ("system", "You are a translation expert, please translate the text as per user needs"),        ("human", query)    ])
    prompt = prompt.partial(origin_lang="English", target_lang="Chinese")
    # Construct Chain    glm_4_plus_chain = prompt | chat_glm_4_plus | StrOutputParser()    gpt_4o_chain = prompt | chat_gpt_4o | StrOutputParser()    qwen_32b_chain = prompt | chat_qwen_32b | StrOutputParser()
    print("Translation evaluation starts\n\n")
    # Create dataset loader    loader = DataSetLoader()
    # Create tokenizer    tokenizer = Tokenizer()
    # Create BLEU score calculator    calculator = BleuScoreCaculator()
    count = 20  # Evaluate 20 data points, can be adjusted as needed
    glm_4_plus_total_score: float = 0    gpt_4o_total_score: float = 0    qwen_32b_total_score: float = 0
    result: List[Dict[str, Any]] = []
    for i in range(count):        # Execute translation        print(f"\n==========Group {i + 1}==========\n")        origin = loader.get_origin_content(i)        print(f"[Original Content]: {origin}\n")        ref_trans = loader.get_ref_trans(i)        print(f"[Reference Translation]: {ref_trans}\n")
        glm_4_plus_trans = glm_4_plus_chain.invoke({"content": origin})        print(f"[glm-4-plus Translation Result]: {glm_4_plus_trans}\n")        gpt_4o_trans = gpt_4o_chain.invoke({"content": origin})        print(f"[gpt-4o Translation Result]: {gpt_4o_trans}\n")        qwen_32b_trans = qwen_32b_chain.invoke({"content": origin})        print(f"[qwen-32b Translation Result]: {qwen_32b_trans}\n")
        # Tokenization processing        ref_tokens = tokenizer.clean_and_tokenize(ref_trans)        glm_4_plus_trans_tokens = tokenizer.clean_and_tokenize(glm_4_plus_trans)        gpt_4o_trans_tokens = tokenizer.clean_and_tokenize(gpt_4o_trans)        qwen_32b_trans_tokens = tokenizer.clean_and_tokenize(qwen_32b_trans)
        # Calculate BLEU scores        glm_4_plus_trans_score = calculator.calc_score([ref_tokens], glm_4_plus_trans_tokens)        print(f"[glm-4-plus BLEU Score]: {glm_4_plus_trans_score}\n")        gpt_4o_trans_score = calculator.calc_score([ref_tokens], gpt_4o_trans_tokens)        print(f"[gpt-4o BLEU Score]: {gpt_4o_trans_score}\n")        qwen_32b_trans_score = calculator.calc_score([ref_tokens], qwen_32b_trans_tokens)        print(f"[qwen-32b BLEU Score]: {qwen_32b_trans_score}\n")
        glm_4_plus_total_score += glm_4_plus_trans_score        gpt_4o_total_score += gpt_4o_trans_score        qwen_32b_total_score += qwen_32b_trans_score
        # Save results        single_result = {            "origin": origin,            "ref_trans": ref_trans,            "glm_4_plus_trans": glm_4_plus_trans,            "gpt_4o_trans": gpt_4o_trans,            "qwen_32b_trans": qwen_32b_trans,            "glm_4_plus_trans_score": glm_4_plus_trans_score,            "gpt_4o_trans_score": gpt_4o_trans_score,            "qwen_32b_trans_score": qwen_32b_trans_score,        }        result.append(single_result)        print(f"\n{json.dumps(result)}\n")
    print("Translation evaluation completed\n\n")
    # Save results    with open("./trans_result.json", "w") as f:        json.dump(result, f, ensure_ascii=False, indent=4)
    print("[glm-4-plus Average BLEU Score]: ", glm_4_plus_total_score / count)    print("[gpt-4o Average BLEU Score]: ", gpt_4o_total_score / count)    print("[qwen-32b Average BLEU Score]: ", qwen_32b_total_score / count)

3. Conclusion

We tested 20 data points, and the final results are as follows:

[glm-4-plus Average BLEU Score]:  0.6133968696381211[gpt-4o Average BLEU Score]:  0.5818961018843368[qwen-32b Average BLEU Score]:  0.580947364126585

The generated result JSON file format is as follows:

[{    "origin": "For geo-strategists, however, the year that naturally comes to mind, in both politics and economics, is 1989.",    "ref_trans": "然而，作为地域战略学家，无论是从政治意义还是从经济意义上，让我自然想到的年份是1989年。",    "glm_4_plus_trans": "对于地缘战略家来说，无论是在政治还是经济上，自然而然会想到的年份是1989年。",    "gpt_4o_trans": "对于地缘战略家来说，无论在政治还是经济方面，自然而然想到的年份是1989年。",    "qwen_32b_trans": "然而，对于地缘战略家来说，无论是政治还是经济，自然想到的一年是1989年。",    "glm_4_plus_trans_score": 0.5009848620501905,    "gpt_4o_trans_score": 0.42281285383122796,    "qwen_32b_trans_score": 0.528516067289035}]

It can be seen that for Chinese translation, the BLEU scores of these three large models are quite similar, all exceeding 0.5, which can be considered a good translation quality, capable of conveying the basic meaning of the original text with few errors and good fluency. Among them, the highest score is for glm-4-plus, likely due to the extensive fine-tuning and optimization work done by Zhipu AI on Chinese corpus, as data is a crucial factor in machine translation.

This article only evaluated 20 data points, and the results may have some bias. Additionally, different test data can also affect the results, and parameters can be adjusted according to specific business scenarios. Importantly, this provides a relatively objective evaluation method to visually assess the translation effects of different large models, serving as a strong basis for business applications and technical selection.