Introduction
I must say, Qwen is really impressive. It seems that its foundational capabilities have firmly established it as the leader in open source, and it is not at all inferior compared to most closed sources. Many companies’ foundational teams are likely already being judged on the significance of foundational models.
Qwen’s open-source momentum is as fierce as ever, which further convinces me that the foundational model war is over, and now is the time for a multitude of applications to flourish. At this time, everyone should focus more on SFT and RLHF.
Model Architecture

The architecture remains consistent with Qwen 2:
-
Attention Mechanism: GQA
-
Activation Function: SwiGLU
-
Position Encoding: ROPE
-
QKV Bias in Attention Mechanism
-
Normalization: RMSNorm
Pretraining
Pretraining Data
-
Better Data Filtering:Utilized the Qwen2-instruct model as a data quality filter, conducting more detailed multidimensional assessments to effectively retain high-quality training data and filter out low-quality samples across various languages.
-
Better Mathematical and Code Data:Integrated training data from Qwen2.5-Math and Qwen2.5-Coder, which proved to be an effective data integration strategy.
-
Better Synthetic Data:Generated high-quality synthetic data in mathematics, code, and knowledge domains using Qwen2-72B-Instruct and Qwen2-Math-72B-Instruct, followed by strict filtering using reward models and the dedicated Qwen2-Math-RM-72B model.
-
Better Data Mixing:Employed Qwen2-Instruct to classify and balance content from different domains. The following findings were made:
-
E-commerce, social media, entertainment, etc., accounted for an excessively high proportion of data in the web-scale dataset, often containing repetitive, templated, or machine-generated content.Downsampled this batch of data.
-
Although technology, science, and academic research include higher quality information, they are underrepresented.Upsampled this batch of data.
-
The total scale of training data expanded to18T tokens
Long Context Pretraining
-
Two-Stage Pretraining:Initial pretraining length of 4096, followed by an expansion phase to handle longer contexts (32768).
-
UtilizedABF technologyto increase ROPE from 10000 to 1000000.
-
For Qwen2.5-Turbo, a four-stage progressive context expansion strategy: 32768->65536->131072->262144, with a rope base frequency of 10000000.
-
Each stage adopted 40% of the current maximum length sequence and 60% of shorter sequences.
-
Enhanced model inference capability for handling longer sequences: YARN, DCA. These two methods can achieve a fourfold increase in sequence length. Moreover, these methods can improve the processing capability for long texts while maintaining performance on short texts.
Post-Pretraining
SFT
Constructed a dataset of over one million SFT data points, with data synthesis enhancement targeted at key domains. Training parameters: training length 32768, epochs=2, learning rate reduced from 7*10-6 to 7 * 10-7. Weight decay=0.1, gradient clip=1.0.
-
Long Sequence Generation:Utilized back-translation to generate queries from long text data, then imposed length constraints, and finally filtered out low-quality paired data using Qwen2.
-
Mathematics:Introduced the COT data from Qwen2.5-Math. To ensure high quality, employed rejection sampling + reward model and annotated answers for guidance, progressively generating the reasoning process.
-
Coding:Introduced data from Qwen2.5-Coder.
-
Instruction-following:To ensure high-quality instruction-following data, we implemented a strict code-based validation framework. In this approach, large language models generate instructions and their corresponding validation code, along with comprehensive unit tests for cross-validation.
-
Structured Data Understanding:Developed a comprehensive structured understanding dataset covering traditional tasks (such as table Q&A, fact verification, error correction, and structured understanding) as well as complex tasks involving structured and semi-structured data.
-
Logical Reasoning:To enhance the model’s logical reasoning capabilities, we introduced 70,000 new queries covering various fields, including multiple-choice questions, judgment questions, and open-ended questions. The model was trained to systematically address problems using a range of reasoning methods such as deductive reasoning, inductive generalization, analogical reasoning, causal reasoning, and statistical reasoning.
-
Cross-Lingual Transfer:Utilized translation models to translate instructions from high-resource languages into various low-resource languages, generating corresponding response candidates. To ensure the accuracy and consistency of these responses, we assessed the semantic alignment between each multilingual response and its original corresponding response.
-
Robust System Instruction:Constructed hundreds of general system prompts to enhance the diversity of system prompts in post-training, ensuring consistency between system prompts and dialogues.
-
Response Filtering:Employed multiple automatic labeling systems, retaining only those responses deemed flawless by all scoring systems.
RLHF
Implemented a two-stage RLHF:
Offline RL:This phase focuses on areas where the RM Model is difficult to evaluate, such as reasoning, mathematics, coding, and instruction-following. In this phase, we used our quality-checked responses as positive examples and those that did not pass as negative examples. To further enhance the reliability and accuracy of training, we employed both manual and automated review streams. Ultimately, a dataset of 150,000 training pairs was constructed.
Online RL:Mainly utilized the RM Model to detect differences in response quality, establishing a set of standards to define data:
-
Truthfulness:Responses must be based on factual accuracy, faithfully reflecting the provided context and instructions. The model should avoid generating false or unfounded information.
-
Helpfulness:The model’s output should be genuinely helpful, effectively addressing user queries while providing positive, engaging, educational, and relevant content. It should precisely follow the given instructions and provide value to the user.
-
Conciseness:Responses should be concise and clear, avoiding unnecessary verbosity. The goal is to convey information in a clear and efficient manner without overwhelming the user with excessive details.
-
Relevance:All parts of the response should be directly relevant to the user’s query, dialogue history, and the assistant’s context. The model should adjust its output to ensure a perfect match with the user’s needs and expectations.
-
Harmlessness:The model must prioritize user safety, avoiding any content that could lead to illegal, unethical, or harmful behavior. It should always promote ethical conduct and responsible communication.
-
Debiasing:The model should produce unbiased responses, including but not limited to biases related to gender, race, nationality, and politics. It should treat all subjects equally and fairly, adhering to widely accepted moral and ethical standards.
Reward Model:Prompts come from two different datasets: open-source data and high-complexity proprietary datasets, with responses generated from Qwen model checkpoints. Iterative experiments revealed that the current Reward Model evaluation benchmarks do not accurately predict the performance of the RL Model trained under its guidance. In other words, scoring high on the RM benchmark does not necessarily indicate that the RL Model trained under it will perform well.
Several points to note:
-
Offline RL uses DPO for training; Online RL uses GRPO for training.
-
The query set of the Reward Model is the same as that used in RL.
-
In Online RL, prioritize queries with significantly different response scores to ensure effective learning.
-
Each query samples 8 responses.
Long Context
-
SFT Stage:Utilized two-stage training, where the first stage only used short texts for fine-tuning, with each instruction containing a maximum of 32768 tokens; the second stage combined short texts (maximum 32768) and long texts (maximum 262144). This method effectively enhances the model’s instruction-following ability in long context tasks while maintaining its performance in short tasks.
-
RL Stage:Focused solely on short instructions. This is mainly due to: 1. RL training is computationally expensive for long contexts; 2. Lack of suitable RM signals for long context tasks; 3. Short instruction RL can still significantly enhance performance in long context tasks.
In Conclusion
The development of large models is happening too rapidly; it’s hard to keep up with the papers. Let’s cherish this journey together.
To join the technical exchange group, please add the AINLP assistant on WeChat (id: ainlp2)
Please specify the specific direction + related technical points.
About AINLP
AINLP is an interesting AI natural language processing community, focusing on sharing technologies related to AI, NLP, machine learning, deep learning, recommendation algorithms, etc. Topics include LLM, pre-training models, automatic generation, text summarization, intelligent Q&A, chatbots, machine translation, knowledge graphs, recommendation systems, computational advertising, recruitment information, job experience sharing, etc. Welcome to follow! To join the technical exchange group, please add the AINLP assistant on WeChat (id: ainlp2), specifying your work/research direction + purpose of joining the group.