Interpretation of Qwen2.5 Technical Report

MLNLP community is a well-known machine learning and natural language processing community both domestically and internationally, covering NLP master’s and doctoral students, university professors, and corporate researchers.

The vision of the community is to promote communication and progress between the academic and industrial circles of natural language processing and machine learning, especially for the advancement of beginners.

Reprinted from | Baobao Algorithm Notes

Interpretation of Qwen2.5 Technical Report

Recently, the Qwen team released the technical report titled “Qwen2.5 Technical Report”. Below is a brief summary of the report to help everyone gain a quick understanding.

Link: https://arxiv.org/pdf/2412.15115

Abstract

Interpretation of Qwen2.5 Technical Report

Qwen2.5 is a series of large language models (LLMs) designed to meet diverse needs. Compared to previous versions, Qwen 2.5 has made significant improvements in both pre-training and post-training phases. The pre-training dataset has been expanded from 70 trillion tokens to 180 trillion tokens, providing the model with a solid foundation of common sense, expert knowledge, and reasoning capabilities. The post-training phase includes complex supervised fine-tuning with over 1 million samples and multi-stage reinforcement learning, significantly enhancing the model’s alignment with human preferences, long text generation, structured data analysis, and instruction-following capabilities.

Features of the Qwen2.5 Series

Diverse configurations: Offers different sizes of base models and instruction-tuned models ranging from 0.5B to 72B parameters, as well as quantized versions.
Performance: Demonstrates outstanding performance in multiple benchmark tests, especially in language understanding, reasoning, mathematics, coding, and alignment with human preferences.
Model scale: Qwen2.5-72B-Instruct competes in performance with Llama-3-405B-Instruct, which is five times larger.

Architecture and Tokenizer

Interpretation of Qwen2.5 Technical Report

The Qwen2.5 series includes dense models based on Transformer and MoE (Mixture of Experts) models for API services. The model architecture includes grouped query attention, SwiGLU activation functions, and rotary position embeddings. The tokenizer uses byte-level Byte Pair Encoding (BBPE) with a vocabulary of 151,643 regular tokens.

Pre-training

The quality of the pre-training data has been significantly improved, including better data filtering, integration of mathematical and code data, generation of synthetic data, and data mixing (mainly through data manipulation). The pre-training data has increased from 70 trillion tokens to 180 trillion tokens.

Long context pre-training has been enhanced, with RoPE’s base increasing from 10,000 to 1,000,000, and context length expanding from 4,096 to 32,768. To support long-context performance, YARN (Peng et al., 2023) and Dual Chunk Attention (DCA, An et al., 2024) techniques have been used, enabling Qwen2.5-Turbo to handle 1 million tokens while other models can manage up to 131,072 tokens.

Post-training

Qwen 2.5 introduces two significant advancements in post-training design: expanded coverage of supervised fine-tuning data (mainly through data manipulation) and two-stage reinforcement learning (first offline DPO followed by online GRPO).

Evaluation

The Qwen2.5 series models have been evaluated on multiple benchmark tests, including natural language understanding, programming, mathematics, and multilingual capabilities. Qwen2.5-72B and Qwen2.5-Plus excelled in various tasks, competing with leading open-weight models.

Base Model

Interpretation of Qwen2.5 Technical Report

Instruct Model

Interpretation of Qwen2.5 Technical Report

On Our In-house Chinese Automatic Evaluation

Long Text

Interpretation of Qwen2.5 Technical Report

Conclusion

Qwen2.5 represents a significant advancement in large language models, offering various configurations and excelling in multiple benchmark tests. The powerful performance, flexible architecture, and broad availability of Qwen2.5 make it a valuable resource for academic research and industrial applications.

Interpretation of Qwen2.5 Technical Report

Abstract

Features of the Qwen2.5 Series

Architecture and Tokenizer

Pre-training

Post-training

Evaluation

Base Model

Instruct Model

On Our In-house Chinese Automatic Evaluation

Long Text

Conclusion

About Us

Leave a Comment Cancel reply