Qwen2.5 Technical Report

In December 2024, the paper “Qwen2.5 Technical Report” from Tongyi Qianwen was released.

This report introduces Qwen2.5, a series of comprehensive large language models (LLMs) designed to meet diverse needs. Compared to previous iterations, Qwen 2.5 has made significant improvements in both pre-training and post-training phases. In terms of pre-training, the high-quality pre-training dataset has been expanded from 7 trillion tokens to 18 trillion tokens, providing a solid foundation for common sense, expert knowledge, and reasoning abilities. In the post-training phase, complex supervised fine-tuning was achieved with over 1 million samples, along with multi-stage reinforcement learning, including offline learning DPO and online learning GRPO. The post-training techniques significantly enhance human preferences and improve long text generation, structured data analysis, and instruction adherence.

To effectively handle diverse and varied use cases, the richly configured Qwen2.5 LLM series is offered. The open-weight products include base models and instruction-tuned models, with parameter sizes of 0.5B, 1.5B, 3B, 7B, 14B, 32B, and 72B respectively. Quantized versions of the instruction-tuned models are also available. Over 100 models can be accessed from Hugging Face Hub, ModelScope, and Kaggle. Additionally, for hosted solutions, proprietary models currently include two mixture of experts (MoE) variants: Qwen2.5-Turbo and Qwen2.5-Plus, both available from Alibaba Cloud Model Studio.

Qwen2.5 exhibits top-tier performance across a wide range of benchmark tests evaluating language understanding, reasoning, mathematics, coding, and human preference matching. Specifically, the flagship model Qwen2.5-72B-Instruct outperforms many open and proprietary models and is competitive with the state-of-the-art open-weight model Llama-3-405B-Instruct, which is approximately five times larger. Qwen2.5-Turbo and Qwen2.5-Plus offer exceptional cost-effectiveness while remaining competitive with GPT-4o-mini and GPT-4o. Furthermore, Qwen2.5 serves as a foundation for training specialized models such as Qwen2.5-Math (Yang, 2024b), Qwen2.5-Coder (Hui, 2024), QwQ (Qwen Team, 2024d), and multimodal models.

With the rapid development of large foundational models, especially large language models (LLMs), the spark of general artificial intelligence (AGI) has become increasingly evident (Brown, 2020; OpenAI, 2023; 2024a; Gemini Team, 2024; Anthropic, 2023a; b; 2024; Bai, 2023; Yang, 2024a; Touvron, 2023a; b; Dubey, 2024). Continuous advancements in model and data scaling, coupled with the large-scale pre-training paradigm, subsequent high-quality supervised fine-tuning (SFT), and reinforcement learning from human feedback (RLHF) (Ouyang, 2022), enable large language models (LLMs) to develop emerging capabilities in language understanding, generation, and reasoning. Building on this, recent breakthroughs in inference time extension, particularly the o1 (OpenAI, 2024b) demonstration, enhance LLMs’ ability to engage in deep thinking through iterative reasoning and reflection. These developments enhance the potential of language models, indicating that they may achieve significant breakthroughs in scientific exploration as they continue to demonstrate emerging capabilities indicative of more general artificial intelligence.

In addition to the rapid advancement of model capabilities, the LLM community has witnessed an explosion of open (open-weight) large language models in recent years, such as the Llama series (Touvron, 2023a; b; Dubey, 2024), Mistral series (Jiang, 2023a; 2024), and Qwen series (Bai, 2023; Yang, 2024a; Qwen Team, 2024a; Hui, 2024; Qwen Team, 2024c; Yang, 2024b). Open-weight models democratize access to large language models, allowing ordinary users and developers to participate more widely in research, fostering innovation through community collaboration, and accelerating the development of AI applications across various fields.

Recently, details of the latest version of the Qwen series, Qwen2.5, were released. In the open-weight section, pre-trained and instruction-tuned models of seven sizes (0.5B, 1.5B, 3B, 7B, 14B, 32B, 72B) were released, providing not only raw models with bfloat16 precision but also quantized models of different precisions. The flagship model Qwen2.5-72B-Instruct demonstrates competitive performance compared to the state-of-the-art open-weight model Llama-3-405B-Instruct, which is about five times larger. Additionally, proprietary models of mixture of experts (MoE, Lepikhin et al., 2020; Fedus et al., 2022; Zoph et al., 2022), namely Qwen2.5-Turbo and Qwen2.5-Plus1, have been released, competing with GPT-4o-mini and GPT-4o.

Essentially, the Qwen2.5 series includes dense models for open source, i.e., Qwen2.5-0.5B / 1.5B / 3B / 7B / 14B / 32B / 72B, as well as MoE models for API services, namely Qwen2.5-Turbo and Qwen2.5-Plus. Below, detailed information about the model architecture is provided.

For the dense models, the Transformer-based decoder architecture (Vaswani, 2017; Radford, 2018) is retained as in Qwen2 (Yang, 2024a). This architecture includes several key components: Grouped Query Attention (GQA, Ainslie, 2023) for efficient utilization of KV cache, the SwiGLU activation function (Dauphin, 2017) for nonlinear activation, Rotational Position Embedding (RoPE, Su, 2024) for encoding positional information, QKV biases in the attention mechanism (Su, 2023), and RMSNorm (Jiang, 2023b) with pre-normalization to ensure stable training.

On the basis of the dense model architecture, it is expanded to the MoE model architecture. This is achieved by replacing standard feedforward network (FFN) layers with dedicated MoE layers, where each layer contains multiple FFN experts and a routing mechanism that assigns tokens to the top K experts. Following the methods demonstrated in Qwen1.5-MoE (Yang, 2024a), fine-grained expert partitioning (Dai, 2024) and shared expert routing (Rajbhandari, 2022; Dai, 2024) are implemented. These architectural innovations have significantly improved model performance on downstream tasks.

For tokenization, the Qwen tokenizer (Bai, 2023) is used, which implements byte-level byte pair encoding (BBPE, Brown, 2020; Wang, 2020; Sennrich, 2016) with a vocabulary containing 151,643 regular tokens. Compared to previous Qwen versions, the control token set has been expanded from 3 to 22 tokens, adding two new tokens for tool functionality and allocating the remainder to other model functionalities. This expansion establishes a unified vocabulary across all Qwen2.5 models, enhancing consistency and reducing potential compatibility issues.

The pre-training process of the language model consists of several key parts. First, high-quality training data is meticulously selected through a sophisticated filtering and scoring mechanism, combined with strategic data mixing. Second, extensive research on hyperparameter optimization is conducted to effectively train models of various sizes. Finally, specialized long-context pre-training is incorporated to enhance the model’s ability to handle and understand extended sequences. Below, the methods for data preparation, hyperparameter selection, and long-context training are detailed.

Pre-training Data

Compared to previous generations of Qwen2, Qwen2.5 shows significant improvements in the quality of pre-training data. These improvements stem from several key aspects:

(1) Better Data Filtering. High-quality pre-training data is crucial for model performance, so data quality evaluation and filtering become an important part of the process. Using the Qwen2-Instruct model as a data quality filter, comprehensive multidimensional analyses are performed to evaluate and score training samples. The filtering methods represent significant advancements over those previously used for Qwen2, as they benefit from the larger multilingual corpus on which Qwen2 was pre-trained. Enhanced capabilities allow for more nuanced quality assessments, thereby improving the retention rate of high-quality training data while more effectively filtering out low-quality samples across multiple languages.
(2) Better Math and Code Data. During the pre-training phase of Qwen2.5, training data from Qwen2.5-Math (Yang, 2024b) and Qwen2.5-Coder (Hui, 2024) is integrated. This data integration strategy has proven highly effective, as these specialized datasets are crucial for achieving state-of-the-art performance on math and coding tasks. By leveraging these high-quality domain-specific datasets during pre-training, Qwen2.5 inherits robust capabilities in mathematical reasoning and code generation.
(3) Better Synthetic Data. To generate high-quality synthetic data, particularly in mathematics, coding, and knowledge domains, Qwen2-72B-Instruct (Yang, 2024a) and Qwen2-Math-72B-Instruct (Qwen Team, 2024c) are utilized. Strict filtering is conducted using proprietary general reward models and specialized Qwen2-Math-RM-72B (Qwen Team, 2024c) models to further enhance the quality of synthetic data.
(4) Better Data Mixing. To optimize the distribution of pre-training data, the Qwen2-Instruct model is used to categorize and balance content from different domains. Analysis shows that fields such as e-commerce, social media, and entertainment are disproportionately overrepresented in web-scale data, often containing repetitive, template-based, or machine-generated content. In contrast, fields like technology, science, and academic research, while containing higher quality information, have traditionally been underrepresented. By strategically downsampling overrepresented domains and upsampling high-value domains, a more balanced and informative training dataset is ensured, better serving the model’s learning objectives.

Based on these techniques, a larger and higher-quality pre-training dataset has been developed, expanding from 7 trillion tokens used in Qwen2 (Yang et al., 2024a) to 18 trillion tokens.

Scaling Laws for Hyperparameters

Hyperparameter scaling laws are formulated based on the pre-training data of Qwen2.5 (Hoffmann, 2022; Kaplan, 2020). While previous studies (Dubey, 2024; Almazrouei, 2023; Hoffmann, 2022) primarily used scaling laws to determine the optimal model size for a given compute budget, they were utilized to determine optimal hyperparameters across model architectures. Specifically, scaling laws help identify critical training parameters, such as batch size B and learning rate μ for dense models and MoE models of different sizes.

Through extensive experiments, the relationship between model architecture and optimal training hyperparameters is systematically studied. Specifically, the analysis examines how optimal learning rate μ/opt and batch size B/opt change with model size N and pre-training data size D. Experiments cover a wide range of architectures, including dense models with 44M to 14B parameters and MoE models with 44M to 1B active parameters, trained on datasets ranging from 0.8B to 600B tokens. Using these optimal hyperparameter predictions, the final loss is modeled as a function of model architecture and training data scale.

Additionally, scaling laws are used to predict and compare the performance of MoE models with different parameter counts against their dense model counterparts. This analysis guides the hyperparameter configuration of MoE models, allowing for performance equivalence with specific dense model variants (e.g., Qwen2.5-72B and Qwen2.5-14B) through careful tuning of active parameters and total parameters.

Long Context Pre-training

To achieve optimal training efficiency, Qwen2.5 adopts a two-stage pre-training approach: an initial stage with a context length of 4,096 tokens, followed by an extended stage for longer sequences. Following the strategy used in Qwen2, the context length of all model variants except Qwen2.5-Turbo is extended from 4,096 tokens to 32,768 tokens in the final pre-training stage. Simultaneously, the base frequency of RoPE is increased from 10,000 to 1,000,000 using ABF technology (Xiong, 2023).

For Qwen2.5-Turbo, a progressive context length extension strategy is implemented during training, divided into four stages: 32,768 tokens, 65,536 tokens, 131,072 tokens, and ultimately 262,144 tokens, with RoPE base frequency set at 10,000,000. At each stage, training data is carefully selected to include 40% of the current maximum length sequences and 60% of shorter sequences. This progressive training approach allows for a smooth adaptation to increasing context lengths while maintaining the model’s ability to effectively handle and generalize across different length sequences.

To enhance the model’s ability to handle longer sequences during reasoning, two key strategies are implemented: Yet another RoPE extensioN (YARN, Peng, 2023) and Dual Chunk Attention (DCA, An, 2024). Through these innovations, the sequence length capacity is quadrupled, enabling Qwen2.5-Turbo to handle up to 1 million tokens, while other models can handle up to 131,072 tokens. Notably, these methods improve long sequence modeling by reducing perplexity and maintaining strong performance on shorter sequences, ensuring consistent quality across varying input lengths.

Compared to Qwen 2, Qwen 2.5 introduces two significant improvements in post-training design:

(1) Expanded Coverage of Supervised Fine-tuning Data: The supervised fine-tuning process utilizes a massive dataset containing millions of high-quality examples. This expansion specifically addresses key areas where previous models exhibited limitations, such as long sequence generation, mathematical problem-solving, coding, instruction adherence, structured data understanding, logical reasoning, cross-lingual transfer, and robust system instructions.

(2) Two-stage Reinforcement Learning: The reinforcement learning (RL) process in Qwen 2.5 is divided into two distinct stages: offline RL and online RL.

• Offline RL: This stage focuses on developing capabilities that are difficult to evaluate using reward models, such as reasoning, factuality, and instruction adherence. By carefully constructing and validating training data, offline RL signals are ensured to be both learnable and reliable (Xiang, 2024), enabling the model to effectively acquire these complex skills.
• Online RL: The online RL stage leverages the reward model’s ability to detect subtle differences in output quality, including factual accuracy, usefulness, conciseness, relevance, harmlessness, and debiasing. It enables the model to generate precise, coherent, and well-structured responses while maintaining safety and readability. Thus, the model’s outputs consistently meet human quality standards and expectations.

Supervised Fine-tuning

Key improvements made by Qwen2.5 during the SFT stage include several key areas:

(1) Long Sequence Generation: Qwen2.5 is capable of generating high-quality content with output context lengths of up to 8,192 tokens, representing a significant advancement over typical post-training response lengths, which usually remain below 2,000 tokens. To bridge this gap, a long response dataset is developed (Quan et al., 2024). Long-text data queries are generated using back-translation techniques from the pre-training corpus, imposing output length constraints, and using Qwen2 to filter out low-quality paired data.

(2) Mathematics: The introduction of thinking chain data from Qwen2.5-Math (Yang et al., 2024b) encompasses various query sources, including public datasets, K-12 problem sets, and comprehensive problems. To ensure high-quality reasoning, rejection sampling (Yuan, 2023) is employed, along with reward models and annotated answers as guidance to produce step-by-step reasoning processes.

(3) Coding: To enhance coding capabilities, instruction-tuning data from Qwen2.5-Coder (Hui, 2024) is integrated. Various language-specific intelligences are incorporated into a collaborative framework, generating diverse and high-quality instruction pairs across nearly 40 programming languages. The instruction dataset is expanded by synthesizing new examples from code-related Q&A websites and collecting algorithm code snippets from GitHub. A comprehensive multilingual sandbox is used to execute static code checks and validate code snippets through automated unit tests, ensuring code quality and correctness (Dou, 2024; Yang, 2024c).

(4) Instruction Tracking: To ensure high-quality instruction tracking data, a rigorous code verification framework is implemented. In this approach, the LLM generates instructions and corresponding verification code, along with comprehensive unit tests for cross-validation. Through rejection sampling based on execution feedback, training data is carefully selected for supervised fine-tuning, ensuring that the model faithfully adheres to the intended instructions (Dong, 2024).

(5) Structured Data Understanding: A comprehensive structured understanding dataset is developed, covering traditional tasks such as table question answering, factual verification, error correction, and structural understanding, as well as complex tasks involving structured and semi-structured data. By incorporating reasoning chains into the model’s responses, its ability to infer information from structured data is significantly enhanced, improving its performance across these different tasks. This approach not only broadens the dataset’s scope but also deepens the model’s ability to reason and draw meaningful insights from complex data structures.

(6) Logical Reasoning: To enhance the model’s logical reasoning capabilities, a new set of 70,000 queries covering various domains is introduced. These queries include multiple-choice questions, judgment questions, and open-ended problems. The model is trained to systematically solve problems using a range of reasoning methods, including deductive reasoning, inductive generalization, analogical reasoning, causal reasoning, and statistical reasoning. Through iterative improvements, the process systematically filters out data containing incorrect answers or flawed reasoning processes. This process gradually enhances the model’s logical reasoning capabilities and accuracy, ensuring robust performance across different types of reasoning tasks.

(7) Cross-lingual Transfer: To facilitate the model’s general capabilities in cross-lingual transfer, instructions are translated from high-resource languages to various low-resource languages using translation models, generating corresponding response candidates. To ensure the accuracy and consistency of these responses, each multilingual response is evaluated for semantic alignment with its original response. This process preserves the logical structure and stylistic nuances of the original responses, maintaining their integrity and coherence across different languages.

(8) Robust System Instructions: Hundreds of general system prompts are constructed to enhance the diversity of post-training system prompts, ensuring consistency between system prompts and dialogues. Evaluations of different system prompts indicate that the model maintains good performance (Lu et al., 2024b) and reduces variance, signifying improved robustness.

(9) Response Filtering: To assess the quality of responses, various automatic annotation methods are employed, including dedicated critic models and multi-agent collaborative scoring systems. Responses undergo rigorous evaluation, with only those deemed flawless by all scoring systems retained. This comprehensive approach ensures that outputs maintain the highest quality standards.

Ultimately, a dataset containing over 1 million SFT examples is constructed. The model is fine-tuned for two periods, with a sequence length of 32,768 tokens. To optimize learning, the learning rate gradually decreases from 7 × 10−6 to 7 × 10−7. To address overfitting, a weight decay of 0.1 is applied, and the maximum value of the gradient norm is limited to 1.0.

Offline Reinforcement Learning

Compared to online reinforcement learning (RL), offline reinforcement learning can prepare training signals in advance, which is particularly beneficial for tasks with standard answers but are challenging to evaluate using reward models. In this study, the focus is on objective query domains such as mathematics, coding, instruction adherence, and logical reasoning, where obtaining accurate evaluations can be complex. In the previous stage, strategies such as execution feedback and answer matching were widely adopted to ensure the quality of responses. For the current stage, this pipeline is reused, with the SFT model resampling responses to a new set of queries. Responses that pass quality checks will be used as positive examples, while those that do not pass will be treated as negative examples for direct preference optimization (DPO) training (Rafailov, 2023). To further enhance the reliability and accuracy of training signals, both manual and automated review processes are employed (Cao, 2024). This dual approach ensures that the training data is not only learnable but also aligned with human expectations. Ultimately, a dataset consisting of approximately 150,000 training pairs is constructed. The model is then trained for one period using an online merging optimizer (Lu et al., 2024a) with a learning rate of 7 × 10−7.

Online Reinforcement Learning

To develop a robust reward model for online reinforcement learning, a set of carefully defined labeling criteria is adhered to. These criteria ensure that the responses generated by the model are not only of high quality but also adhere to ethical and user-centered standards (Wang et al., 2024a). The specific criteria for data labeling are as follows:

• Authenticity: Responses must be factually accurate and faithfully reflect the provided context and instructions. The model should avoid generating false or unsupported information.

• Usefulness: The model’s outputs should be genuinely useful, effectively addressing user queries while providing positive, engaging, educational, and relevant content. It should strictly adhere to the given instructions and provide value to the user.

• Conciseness: Responses should be concise and clear, avoiding unnecessary verbosity. The goal is to communicate information clearly and effectively without overwhelming the user with excessive details.

• Relevance: All parts of the response should be directly related to the user’s query, dialogue history, and the assistant’s context. The model should tailor its output to ensure that it fully meets the user’s needs and expectations.

• Harmlessness: The model must prioritize user safety, avoiding any content that could lead to illegal, unethical, or harmful behavior. It should consistently advocate for ethical behavior and responsible communication.

• Bias Elimination: The model should produce responses free from biases, including but not limited to gender, race, nationality, and politics. It should treat all subjects fairly and equitably, adhering to broadly accepted ethical and moral standards.

The queries used to train the reward model come from two different datasets: publicly available open-source data and a proprietary query set with higher complexity. Responses are generated from checkpoints of the Qwen model, which have been fine-tuned at various stages of training using different methods (SFT, DPO, and RL). To introduce diversity, these responses are sampled at different temperature settings. Preference pairs are created through both manual and automated labeling processes, and DPO training data is also integrated into this dataset.

In the online reinforcement learning (RL) framework, population relative policy optimization (GRPO, Shao, 2024) is adopted. The query set used to train the reward model is the same as that used during the RL training phase. The processing order of queries during training is determined by the variance of response scores evaluated by the reward model. Specifically, queries with higher variance in response scores will be prioritized to ensure more effective learning. Eight responses are sampled for each query. All models are trained with a global batch size of 2048 and 2048 samples per batch, treating a pair of query and response as a sample.

Long Context Fine-tuning

To further extend the context length of Qwen2.5-Turbo, longer SFT examples are introduced during post-training, allowing better alignment with human preferences for long queries.

During the SFT stage, a two-stage approach is adopted. In the first stage, the model is fine-tuned using only short instructions, with each instruction containing a maximum of 32,768 tokens. This stage uses the same data and training steps as other Qwen2.5 models to ensure excellent performance on short tasks. In the second stage, the fine-tuning process combines short instructions (up to 32,768 tokens) and long instructions (up to 262,144 tokens). This mixed approach effectively enhances the model’s instruction adherence capabilities in long-context tasks while maintaining its performance on short tasks.

In the RL stage, similar training strategies are employed as with other Qwen2.5 models, focusing solely on short instructions. This design choice is primarily motivated by two considerations: first, reinforcement learning training is computationally expensive for long-context tasks; second, there is currently a lack of reward models capable of providing suitable reward signals for long-context tasks. Moreover, adopting reinforcement learning solely on short instructions can still significantly enhance the model’s consistency with human preferences in long-context tasks.

Leave a Comment Cancel reply