1. Background Introduction
With the rapid development of natural language processing technology, machine translation has become an important research field. In recent years, large model-based language models have made significant progress in machine translation tasks. These large models usually have hundreds of millions or even hundreds of billions of parameters, enabling them to better understand and generate natural language.
However, there are hundreds or even thousands of large models available on the market, each with its own functional characteristics and applicable scenarios. How should we evaluate the translation performance of different models? There may be various solutions. This article attempts to use the WMT dataset + BLEU scoring mechanism to relatively comprehensively evaluate the translation capabilities of several large models.
First, let’s briefly supplement some basic knowledge.
The WMT Dataset
The WMT (Workshop on Machine Translation) dataset is a series of benchmark datasets used for machine translation, provided by the annual WMT conference. The WMT conference is an important international conference in the field of machine translation, held annually since 2006, aimed at promoting the development of machine translation technology.
The WMT dataset includes translation data for various language pairs, typically sourced from news articles, parliamentary records, books, and other publicly available text resources. These datasets are widely used for training, evaluating, and comparing different machine translation systems. Some well-known language pairs include English-French, English-German, English-Spanish, etc. WMT provides authoritative benchmark data that we can use to evaluate the translation accuracy of different models.
BLEU Scoring
BLEU (Bilingual Evaluation Understudy) scoring is an automatic evaluation metric used to assess the quality of machine translation output. BLEU was proposed by IBM in 2002 to provide a fast, objective, and low-cost method for evaluating the performance of translation systems. BLEU scoring has become one of the most widely used evaluation standards in the field of machine translation.
BLEU generates a score within the range of [0,1] by calculating metrics such as precision, modified precision, geometric mean, and final score, where 1 indicates perfect matching, meaning the machine translation output is completely consistent with the reference translation. By comparing the relative scores between different large models, we can evaluate the differences in translation capabilities.
With this basic knowledge, we can implement the specific evaluation program.
The following program uses the LangChain framework, taking the English-to-Chinese scenario as an example, to introduce the specific implementation process of translation evaluation.
2. Implementation Process
Loading the Corpus Dataset
First, install the datasets library:
pip install datasets
Then we implement a DataSetLoader to load the WMT corpus dataset. The wmt19 repository contains the English-Chinese translation dataset. We assume English as the source language and Chinese as the target translation language.
from datasets import load_dataset
class DataSetLoader: """Dataset Loader"""
def __init__(self): """Initialization method"""
# Load English to Chinese dataset self.ds = load_dataset('wmt19', 'zh-en') print("Loaded [en-zh] dataset successfully")
def get_origin_content(self, idx: int) -> str: """Get original content""" return self.ds['train'][idx]['translation']['en']
def get_ref_trans(self, idx: int) -> str: """Get reference translation""" return self.ds['train'][idx]['translation']['zh']
Calculating BLEU Scores
Next, we define a BleuScoreCalculator component to calculate the BLEU score. Here, we can directly use the nltk.translate package (which needs to be installed):
from nltk.translate.bleu_score import sentence_bleu
class BleuScoreCaculator: """BLEU Score Calculator"""
@staticmethod def calc_score(references, hypothesis) -> float: """Calculate BLEU score""" return sentence_bleu(references, hypothesis, weights=(1,))
Tokenization Processing
In addition, to avoid the impact of different tokenization rules, we develop a tokenizer component that tokenizes text according to a uniform rule. The tokenizer library can use the widely used jieba:
from typing import List
import jieba
class Tokenizer: """Tokenizer"""
@staticmethod def clean_and_tokenize(text: str) -> List[str]: """ Clean text and tokenize :param text: Original text :return: Token list """
# Remove extra spaces and punctuation trimmed = text.replace('\n', ' ').replace(' ', ' ').strip()
# Use jieba for tokenization return list(jieba.cut(trimmed))
Translation Evaluation
All the basic components are ready, and we can now complete the core translation evaluation functionality. We use the LangChain framework to construct a standardized processing chain.
We use the following three candidate models:
- glm-4-plus
- gpt-4o
- qwen-32b
These three are currently very powerful models in the industry. So how do their translation capabilities compare? Let’s write some code to find out:
import jsonfrom typing import List, Dict, Any
import dotenvfrom langchain_community.chat_models import ChatZhipuAIfrom langchain_core.output_parsers import StrOutputParserfrom langchain_core.prompts import ChatPromptTemplatefrom langchain_openai import ChatOpenAI
from bleu import BleuScoreCaculatorfrom loader import DataSetLoaderfrom tokenizer import Tokenizer
if __name__ == '__main__': # Load environment variables dotenv.load_dotenv()
# Compare 3 LLMs: glm-4-plus, gpt-4o and qwen-32b chat_glm_4_plus = ChatZhipuAI(model="glm-4-plus", temperature=0.1)
chat_gpt_4o = ChatOpenAI(model="gpt-4o", temperature=0.1)
chat_qwen_32b = ChatOpenAI(model="qwen-32b", temperature=0.1)
# Construct prompt query = """ Content to be translated: {content}
Source language: {origin_lang}
Target language: {target_lang}
Please note: Generate the translated text directly, no additional information needed! """ prompt = ChatPromptTemplate.from_messages([ ("system", "You are a translation expert, please translate the text as per user needs"), ("human", query) ])
prompt = prompt.partial(origin_lang="English", target_lang="Chinese")
# Construct Chain glm_4_plus_chain = prompt | chat_glm_4_plus | StrOutputParser() gpt_4o_chain = prompt | chat_gpt_4o | StrOutputParser() qwen_32b_chain = prompt | chat_qwen_32b | StrOutputParser()
print("Translation evaluation starts\n\n")
# Create dataset loader loader = DataSetLoader()
# Create tokenizer tokenizer = Tokenizer()
# Create BLEU score calculator calculator = BleuScoreCaculator()
count = 20 # Evaluate 20 data points, can be adjusted as needed
glm_4_plus_total_score: float = 0 gpt_4o_total_score: float = 0 qwen_32b_total_score: float = 0
result: List[Dict[str, Any]] = []
for i in range(count): # Execute translation print(f"\n==========Group {i + 1}==========\n") origin = loader.get_origin_content(i) print(f"[Original Content]: {origin}\n") ref_trans = loader.get_ref_trans(i) print(f"[Reference Translation]: {ref_trans}\n")
glm_4_plus_trans = glm_4_plus_chain.invoke({"content": origin}) print(f"[glm-4-plus Translation Result]: {glm_4_plus_trans}\n") gpt_4o_trans = gpt_4o_chain.invoke({"content": origin}) print(f"[gpt-4o Translation Result]: {gpt_4o_trans}\n") qwen_32b_trans = qwen_32b_chain.invoke({"content": origin}) print(f"[qwen-32b Translation Result]: {qwen_32b_trans}\n")
# Tokenization processing ref_tokens = tokenizer.clean_and_tokenize(ref_trans) glm_4_plus_trans_tokens = tokenizer.clean_and_tokenize(glm_4_plus_trans) gpt_4o_trans_tokens = tokenizer.clean_and_tokenize(gpt_4o_trans) qwen_32b_trans_tokens = tokenizer.clean_and_tokenize(qwen_32b_trans)
# Calculate BLEU scores glm_4_plus_trans_score = calculator.calc_score([ref_tokens], glm_4_plus_trans_tokens) print(f"[glm-4-plus BLEU Score]: {glm_4_plus_trans_score}\n") gpt_4o_trans_score = calculator.calc_score([ref_tokens], gpt_4o_trans_tokens) print(f"[gpt-4o BLEU Score]: {gpt_4o_trans_score}\n") qwen_32b_trans_score = calculator.calc_score([ref_tokens], qwen_32b_trans_tokens) print(f"[qwen-32b BLEU Score]: {qwen_32b_trans_score}\n")
glm_4_plus_total_score += glm_4_plus_trans_score gpt_4o_total_score += gpt_4o_trans_score qwen_32b_total_score += qwen_32b_trans_score
# Save results single_result = { "origin": origin, "ref_trans": ref_trans, "glm_4_plus_trans": glm_4_plus_trans, "gpt_4o_trans": gpt_4o_trans, "qwen_32b_trans": qwen_32b_trans, "glm_4_plus_trans_score": glm_4_plus_trans_score, "gpt_4o_trans_score": gpt_4o_trans_score, "qwen_32b_trans_score": qwen_32b_trans_score, } result.append(single_result) print(f"\n{json.dumps(result)}\n")
print("Translation evaluation completed\n\n")
# Save results with open("./trans_result.json", "w") as f: json.dump(result, f, ensure_ascii=False, indent=4)
print("[glm-4-plus Average BLEU Score]: ", glm_4_plus_total_score / count) print("[gpt-4o Average BLEU Score]: ", gpt_4o_total_score / count) print("[qwen-32b Average BLEU Score]: ", qwen_32b_total_score / count)
3. Conclusion
We tested 20 data points, and the final results are as follows:
[glm-4-plus Average BLEU Score]: 0.6133968696381211[gpt-4o Average BLEU Score]: 0.5818961018843368[qwen-32b Average BLEU Score]: 0.580947364126585
[{ "origin": "For geo-strategists, however, the year that naturally comes to mind, in both politics and economics, is 1989.", "ref_trans": "然而,作为地域战略学家,无论是从政治意义还是从经济意义上,让我自然想到的年份是1989年。", "glm_4_plus_trans": "对于地缘战略家来说,无论是在政治还是经济上,自然而然会想到的年份是1989年。", "gpt_4o_trans": "对于地缘战略家来说,无论在政治还是经济方面,自然而然想到的年份是1989年。", "qwen_32b_trans": "然而,对于地缘战略家来说,无论是政治还是经济,自然想到的一年是1989年。", "glm_4_plus_trans_score": 0.5009848620501905, "gpt_4o_trans_score": 0.42281285383122796, "qwen_32b_trans_score": 0.528516067289035}]
It can be seen that for Chinese translation, the BLEU scores of these three large models are quite similar, all exceeding 0.5, which can be considered a good translation quality, capable of conveying the basic meaning of the original text with few errors and good fluency. Among them, the highest score is for glm-4-plus, likely due to the extensive fine-tuning and optimization work done by Zhipu AI on Chinese corpus, as data is a crucial factor in machine translation.
This article only evaluated 20 data points, and the results may have some bias. Additionally, different test data can also affect the results, and parameters can be adjusted according to specific business scenarios. Importantly, this provides a relatively objective evaluation method to visually assess the translation effects of different large models, serving as a strong basis for business applications and technical selection.