Introduction

The development of large language models (LLMs) is advancing rapidly, with each significant update potentially bringing substantial performance improvements and expanding application scenarios. Against this backdrop, Alibaba’s latest release of the Qwen2.5 series models has garnered widespread attention. This technical report provides a detailed overview of the development process, innovations, and performance of Qwen2.5, showcasing its latest advancements in natural language processing.

The core innovations of the Qwen2.5 series models are primarily reflected in two aspects: pre-training and post-training. During the pre-training phase, the research team expanded the training data scale from 7 trillion tokens to 18 trillion tokens, establishing a solid foundation for the model’s knowledge acquisition and understanding capabilities. In the post-training phase, researchers employed complex techniques including supervised fine-tuning (SFT) with 1 million samples and staged reinforcement learning (including offline learning DPO and online learning GRPO), which significantly improved the model’s alignment with human preferences and enhanced its capabilities in long text generation and structured data analysis.

This article will delve into the development process of the Qwen2.5 model, including its innovative methods during pre-training and post-training phases, as well as its performance across various benchmark tests. Through this report, we can glimpse the forefront of current large language model technology development and understand how Qwen2.5 stands out among numerous competitors, becoming an important force in advancing natural language processing technology.

The structure of the article is as follows:

Qwen2.5 Technical Report Analysis: 18 Trillion Token Training — Qwen2.5 Report Analysis

Innovations in the Pre-training Phase

Breakthroughs in Data Processing

Qwen2.5 has several significant innovations in the processing of pre-training data, greatly enhancing the quality and diversity of the training data.

Intelligent Data Filtering

The research team cleverly utilized the Qwen2 model to perform intelligent filtering of pre-training data. This method not only improved data quality but also enhanced the model’s ability to process multilingual data. Through this self-iterative approach, Qwen2.5 can better identify and retain high-quality training samples while effectively filtering out low-quality data.

Incorporation of Domain-Specific Data

Another highlight of Qwen2.5 is the incorporation of high-quality samples from Qwen2.5 Math and Qwen2.5 Coder. These samples cover the fields of mathematics and programming, significantly enhancing the model’s capabilities in these two key areas. The introduction of this specialized data allows Qwen2.5 to excel in handling mathematical problems and programming tasks.

High-Quality Synthetic Data

The research team also generated high-quality synthetic data using the Qwen2-72B and Qwen2-Math models. Notably, they further screened these synthetic data using the Qwen2-Math-RM model to ensure the quality and relevance of the synthetic data. This approach not only expanded the scale of the training data but also ensured high quality and diversity.

Intelligent Data Mixing

To balance different types of data, researchers used the Qwen2 model to classify the data and then balanced the processing of different categories of data. This method ensures that the model can learn from various types of data, avoiding bias caused by an overabundance of data from certain fields.

Breakthrough Scaling Laws

Another important innovation of Qwen2.5 lies in its application of scaling laws. The research team conducted an in-depth study to find the optimal learning rates and batch sizes at different model sizes (N) and data volumes (D). This approach allows researchers to identify the best training parameters for models of different scales, achieving a balance between training efficiency and model performance.

Innovations in Long Context Processing

Qwen2.5 has also made significant breakthroughs in handling long contexts:

Multi-Stage Training: The model training is divided into two stages, first training at a context length of 4K, then expanding to 32K. This progressive approach allows the model to gradually adapt to longer contexts.
Adjustment of RoPE Base Values: By adjusting the base values of RoPE using ABF technology, the model’s ability to handle long sequences is further enhanced.
Innovations in Qwen2.5-Turbo: This special version adopts a four-stage training strategy (4K, 32K, 64K, 128K), ensuring that 40% of the data reaches maximum length while 60% consists of shorter sequences. This balanced approach allows the model to perform excellently across various input lengths.
Optimization of the Inference Phase: The introduction of YARN and Dual Chunk Attention technologies further enhances the model’s ability to handle long sequences in practical applications.

These innovations enable Qwen2.5 to excel in handling long texts and complex contexts, significantly expanding the model’s application scenarios.

Optimizations in the Post-Training Phase

The post-training phase is a key link in enhancing the performance of the Qwen2.5 model. The research team employed a series of innovative methods during this phase, including complex supervised fine-tuning (SFT) and multi-stage reinforcement learning.

Innovations in Supervised Fine-Tuning (SFT)

Enhancement of Long Sequence Generation Capability

Qwen2.5 has dramatically increased output length to 8K tokens. The research team cleverly generated queries from the pre-training dataset and added length control instructions, enabling the model to generate longer and more coherent texts. This improvement greatly enhances the model’s performance in long text generation tasks.

Enhancement of Mathematical Abilities

The researchers utilized chain-of-thought (CoT) data from Qwen2.5 Math and generated step-by-step reasoning processes using rejection sampling techniques. This method not only improves the model’s mathematical reasoning abilities but also allows it to clearly demonstrate problem-solving steps.

Enhancement of Programming Abilities

Qwen2.5 integrates data from Qwen2.5 Coder, covering nearly 40 programming languages. This extensive language coverage allows the model to excel in various programming tasks, significantly enhancing its application potential in software development.

Optimization of Instruction Following Capabilities

The research team implemented an innovative code verification framework, where the LLM not only generates instructions but also generates corresponding verification code frameworks. By executing feedback rejection sampling, it ensures that the model can accurately understand and execute various instructions.

Enhancement of Structured Data Understanding

By constructing a comprehensive structured data understanding dataset and introducing reasoning chains, Qwen2.5 significantly improves its ability to infer and extract information from complex data structures. This improvement makes the model more adept at handling structured data such as tables and JSON.

Improvement of Logical Reasoning Abilities

The researchers constructed a diversified dataset containing 70,000 entries, covering multiple-choice, true/false, and open-ended questions. These questions systematically include various reasoning methods such as deductive reasoning, inductive generalization, analogical reasoning, causal reasoning, and statistical reasoning, comprehensively enhancing the model’s logical thinking capabilities.

Enhancement of Multilingual Capabilities

The research team utilized translation models to translate instructions from high-resource languages to low-resource languages and assessed the semantic consistency between multilingual responses and original responses. This method not only expands the model’s language coverage but also ensures consistent performance across languages.

Improvement of System Instruction Robustness

By constructing diverse system prompts and ensuring their consistency with dialogue content, Qwen2.5 performs more stably and reliably in handling different types of system instructions.

Strict Control of Response Quality

The researchers employed various automatic evaluation methods, including specialized discriminators and multi-agent scoring systems, retaining only samples deemed high-quality by all evaluation methods. This rigorous screening ensures the high quality of training data.

Innovations in Offline Reinforcement Learning (Offline RL)

To address the issue of inaccurate scoring by the reward model (RM) in certain tasks, the research team innovatively employed execution feedback and answer matching to construct positive and negative samples. This method is particularly suitable for objective task domains such as mathematics and programming. Additionally, the researchers introduced a dual manual checking mechanism to further ensure the reliability and accuracy of training signals.

Breakthroughs in Online Reinforcement Learning (Online RL)

Optimization of Reward Models

The research team established comprehensive evaluation criteria for the reward model, including authenticity, usefulness, conciseness, relevance, harmlessness, and de-biasing. These standards ensure the high quality and ethicality of the model’s outputs.

Diversity of Datasets

The RM and RL training utilized a combination of open-source datasets and more complex private datasets. Responses were sourced from Qwen models at different training stages and generated using different temperature coefficients, ensuring data diversity.

Innovative Training Framework

The researchers adopted the GRPO (Group Relative Policy Optimization) framework for online RL training, a novel reinforcement learning algorithm that can optimize model strategies more effectively.

Intelligent Sampling Strategies

During training, eight outputs were sampled for each query, balancing exploration and exploitation, which helps the model learn more diverse responses.

Through these innovative post-training methods, Qwen2.5 not only excels in various tasks but also demonstrates strong instruction-following capabilities and alignment with human preferences. These improvements make Qwen2.5 a more intelligent, reliable, and user-friendly large language model.

Performance Evaluation of Qwen2.5

The Qwen2.5 series models underwent comprehensive and rigorous evaluations, covering open benchmark tests and internal professional assessments. These evaluations not only showcase the model’s outstanding performance across various tasks but also highlight its remarkable capabilities in long context processing.

Open Benchmark Tests

The research team used a series of widely recognized benchmark tests to evaluate the performance of Qwen2.5:

General Capability Tests: Including MMLU-Pro, MMLU-redux, and LiveBench 0831, used to assess the model’s comprehensive understanding and reasoning abilities.
Scientific and Mathematical Abilities: Using GPQA, GSM8K, and MATH tests to evaluate the model’s performance in scientific reasoning and mathematical problem-solving.
Programming Abilities: Through HumanEval, MBPP, MultiPL-E, and LiveCodeBench tests to comprehensively assess the model’s code generation and understanding capabilities.
Instruction Following Abilities: Using IFEval tests to assess the model’s understanding and execution of instructions.
Alignment with Human Preferences: Using MT-Bench and Arena-Hard tests to evaluate the consistency of model outputs with human expectations.

The results show that Qwen2.5-72B-Instruct performed excellently in multiple benchmark tests, even surpassing some models with larger parameter counts, such as Llama-3.1-405B-Instruct. Particularly in tests like MMLU-redux, MATH, and MBPP, Qwen2.5 demonstrated a leading advantage.

Internal Professional Evaluations

In addition to open benchmark tests, the research team also conducted a series of internal professional evaluations:

Multilingual Capability Evaluation: Using AMMLU, JMMLU, and other multilingual tests to assess the model’s performance in different language environments.
Long Context Capability Evaluation: Using RULER, LV-Eval, and Longbench-Chat tests to evaluate the model’s ability to handle long texts. The results show that Qwen2.5-72B-Instruct performs excellently across various context lengths, outperforming many open-source and proprietary models.
Reward Model Evaluation: Using Reward Bench, RMB, and other tests to evaluate the performance of the reward model used for reinforcement learning.

Breakthroughs in Long Context Processing

Qwen2.5 has achieved significant breakthroughs in long context processing:

Qwen2.5-Turbo: Achieved 100% accuracy in a 1 million token key retrieval task, demonstrating exceptional capabilities in handling ultra-long contexts.
Inference Speed Optimization: By introducing sparse attention mechanisms, the inference speed for long context processing has been significantly improved. For a 1 million token sequence, the computational load was reduced by 12.5 times.
First Token Generation Time (TTFT): Under different hardware configurations, the TTFT of Qwen2.5-Turbo is 3.2 to 4.3 times faster than traditional methods.

These evaluation results fully demonstrate the outstanding performance of the Qwen2.5 series models across various tasks, especially in handling long contexts. Qwen2.5 not only excels in standard benchmark tests but also showcases strong capabilities in processing complex, long texts in practical applications, laying a solid foundation for its use in various real-world scenarios.

Conclusion

The release of the Qwen2.5 series models marks another significant advancement in large language model technology. Through comprehensive innovations in both the pre-training and post-training phases, Qwen2.5 has achieved remarkable breakthroughs in several areas:

Leap in Pre-training Data Scale: Expanding the training data from 7 trillion tokens to 18 trillion tokens provides the model with a broader and deeper knowledge base.
Innovations in Post-training Techniques: Utilizing supervised fine-tuning with 1 million samples and staged reinforcement learning significantly enhances the model’s instruction-following capabilities and alignment with human preferences.
Breakthroughs in Long Context Processing: Particularly, the exceptional performance of Qwen2.5-Turbo in handling long texts of millions of tokens opens up new possibilities for long document processing and complex task resolution.
Enhancement of Multilingual and Multidisciplinary Capabilities: Through diverse datasets and specialized training strategies, Qwen2.5 performs excellently in multilingual understanding, scientific computation, programming, and other fields.
Diversity of Model Series: From 0.5B to 72B parameters in open-source models, as well as proprietary models like Qwen2.5-Turbo and Qwen2.5-Plus, providing flexible options for various application scenarios.

The success of Qwen2.5 is not only reflected in its powerful performance but more importantly, it showcases a new direction for the development of large language model technology. Through innovative data processing methods, training strategies, and architectural designs, Qwen2.5 achieves performance that meets or even exceeds larger models, even with relatively limited model size and computing resources. This efficient and powerful model design philosophy holds significant importance for promoting the popularization and application of AI technology.

Looking ahead, the research team plans to continue advancing in the following areas:

Continuously optimize the base model and instruction fine-tuning model, improving the quality and diversity of data.
Develop multimodal models that integrate text, visual, auditory, and other modalities.
Enhance the model’s reasoning capabilities and explore more effective large-scale reasoning computation methods.

If interested, feel free to follow our WeChat public account to get in touch!