Qwen Technical Report Details Sharing

Qwen Technical Report Details Sharing

Introduction

Alibaba open-sourced the Qwen-7B model a long time ago, but for some reason, it was taken down. Just yesterday, Alibaba re-open-sourced the Qwen-14B model (the original 7B model was also released), and simultaneously released the technical report on Qwen. Today, I would like to share this with everyone.

PS: Now domestic open-source large models are also gradually releasing technical reports, let’s get involved!!!

Report: https://qianwen-res.oss-cn-beijing.aliyuncs.com/QWEN_TECHNICAL_REPORT.pdf
GitHub: https://github.com/QwenLM/Qwen

The technical report introduces the entire Qwen series of models, including Base models, RM models, Chat models, Code models, Math models, and Multi-modal models. Since the Code model and Math model are not yet open-sourced, and the multi-modal Qwen-VL model has its own paper, I will not introduce these three models in this sharing. Interested students can look it up themselves.Qwen Technical Report Details Sharing

To conclude, the Qwen-14B model performs better than the existing 13B models across 12 datasets (involving language understanding, knowledge, reasoning, and other fields), but still lags behind GPT-3.5 and GPT-4.

Qwen Technical Report Details Sharing

Pre-training

Data

The pre-training data totals 3TB, mainly involving public web documents, encyclopedias, books, code, etc. The data covers multiple languages, but primarily Chinese and English. To ensure data quality, a comprehensive preprocessing program was established.

  • Web data needs to extract text content from HTML and use language recognition tools to determine the language;
  • Data diversity is enhanced through duplicate data removal techniques, including exact match duplicate data removal methods after normalization and fuzzy duplicate data removal methods using MinHash and LSH algorithms;
  • Low-quality data is filtered using a combination of rules and machine learning methods, scoring content through multiple models, including language models, text quality scoring models, and models for identifying potential offensive content;
  • Data from various sources is manually sampled and reviewed to ensure its quality;
  • Data from certain sources is selectively sampled to ensure that the model is trained on various high-quality content.

Tokenizer

The vocabulary size affects the model’s training efficiency and downstream task performance. Qwen uses the open-source fast BPE tokenizer – tiktoken, based on the cl100k vocabulary, adding commonly used Chinese words and vocabulary from other languages, and splitting numeric strings into single digits, resulting in a final vocabulary size of 152K.

Comparing the compression rates of different models across different languages, as shown in the figure below, Qwen outperforms LLaMA-7B, Baichuan-7B, ChatGLM-6B, and InternLM-7B models in most languages.

Qwen Technical Report Details Sharing

PS: I don’t know why there was no comparison with the Baichuan2 model.

Model

The model uses the Transformer framework, with the following modifications:

  • Embedding and output projection: The embedding layer and lm_head layer do not share weights; they are two separate weights.
  • Positional embedding: RoPE is used for position encoding, and an inverse frequency matrix with FP32 precision is chosen.
  • Bias: A bias is added in the QKV attention layer to enhance the model’s extrapolation capability.
  • Pre-Norm & RMSNorm: Pre-normalization is used to improve training stability, replacing traditional normalization methods with RMSNorm.
  • Activation function: The SwiGLU activation function is used, which differs from the traditional FFN’s two matrices; SwiGLU has three matrices, thus reducing the hidden layer dimension from four times to 8/3 times.

Extrapolation Capability Expansion

The attention mechanism of the Transformer model has significant limitations on context length, with computational costs and memory increasing exponentially as context length increases. The Qwen model utilizes simple non-training computations to extend context length during inference.

  • Dynamic NTK-aware interpolation, which dynamically scales positional information with the increase of sequence length.
  • LogN-Scaling, which rescales the dot product of Q and V based on the ratio of context length to training length, ensuring that the entropy of attention values remains stable as context length grows.
  • Window attention, which restricts attention within a context window to prevent the model from focusing on distant content. Different layers use different window sizes, with lower layers using shorter windows and higher layers using longer windows.
Qwen Technical Report Details Sharing

Training

  • Follows standard methods of autoregressive language modeling, predicting the next token based on the content of previous tokens;
  • The maximum length during model pre-training is 2048. To construct batch data, the text content is randomly shuffled and merged, and then truncated to the specified length.
  • The attention module uses Flash Attention technology to improve training speed;
  • The optimizer used is AdamW, with hyperparameters β1, β2, and ϵ set to 0.9, 0.95, and 10−8, respectively;
  • A cosine learning rate schedule is used, with the learning rate decaying to 10% of the peak;
  • BFloat16 is used for mixed precision training.

Pre-training Results

The QWEN model performs excellently at the same parameter level, even surpassing larger models like LLaMA2-70B in three tasks.

Qwen Technical Report Details Sharing

Alignment

Supervised Fine-tuning (SFT)

To enhance the capability of the supervised fine-tuning dataset, various styles of dialogues were annotated to focus on natural language generation for different tasks, further improving the model’s usefulness. The size of the training method also affects the model’s performance; Qwen adopts a ChatML-style format for model training. The ChatML format allows the model to effectively distinguish various types of information, including system quality, user input, model output, etc., enhancing the model’s ability to handle and analyze complex conversations.Qwen Technical Report Details Sharing

  • The optimizer used is AdamW, with hyperparameters β1, β2, and ϵ set to 0.9, 0.95, and 1e−8, respectively;
  • The maximum input length of the model is 2048;
  • The training batch size is 128;
  • The model is trained for 4000 steps, with the learning rate gradually increasing to a peak of 2e−6 during the first 1430 steps.
  • To prevent overfitting, the weight decay value is set to 0.1, dropout to 0.1, and the gradient clipping limit to 1.0.

RM Model

For building the reward model, a large amount of data is first used for preference model pre-training (PMP), followed by fine-tuning the reward model with high-quality preference data. High-quality preference data is obtained through a balanced sampling system with 6600 detailed labels to ensure data diversity and complexity.

The reward model is derived from an equally sized Qwen model + pooling layer, using special sentence end token mapping values as model reward values.

During the training process, the learning rate is kept at 3e−6, the batch size is 64, and the maximum length is 2048, training for one epoch.

Qwen Technical Report Details Sharing

Reinforcement Learning (PPO)

The PPO stage includes four models: policy model, value model, reference model, and reward model. During training, the policy model is first trained for 50 steps to ensure that the value model can effectively adapt to different reward models. In the PPO process, two responses are sampled simultaneously for each query, with the KL divergence coefficient set to 0.04, and the rewards normalized based on the average.

The learning rates for the policy model and value model are set to 1e−6 and 5e−6, respectively. To enhance training stability, the clipping value is set to 0.15. During inference, the top-p value for the generation strategy is set to 0.9.

Alignment Results

The performance of Qwen exceeds that of other open-source models of the same scale, such as LLaMA2, ChatGLM2, InternLM, and Baichuan2.Qwen Technical Report Details Sharing

Additionally, a test dataset covering a wide range of topics was constructed for manual evaluation, comparing Qwen-7B-Chat (SFT), Qwen-14B-Chat (SFT), Qwen-14B-Chat (RLHF), and GPT4 in dialogues with GPT3.5. It is evident that the RLHF model significantly outperforms the SFT model, indicating that RLHF can generate more human-preferred responses.Qwen Technical Report Details Sharing

Tool Usage

The Qwen model has tool usage capabilities:

  • It can use unseen tools via ReAct prompting;
  • It enhances mathematical reasoning and data analysis capabilities using a Python interpreter;
  • As an agent, it can access a large collection of multi-modal models in HuggingFace during interactions with humans.

PS: High-quality data includes 2000 React format data.

How to command Qwen to use tools with ReAct Prompting technology
https://github.com/QwenLM/Qwen/blob/main/examples/react_prompt.md

Conclusion

Large models are no longer just about open-sourcing; they are also starting to compete in technical reports~

To join the technical exchange group, please add the AINLP assistant WeChat (id: ainlp2)
Please note the specific direction + related technical points used
About AINLP
AINLP is an interesting AI natural language processing community focused on sharing related technologies in AI, NLP, machine learning, deep learning, recommendation algorithms, etc. Topics include LLM, pre-trained models, automatic generation, text summarization, intelligent Q&A, chatbots, machine translation, knowledge graphs, recommendation systems, computational advertising, recruitment information, job experience sharing, etc. Welcome to follow! To join the technical exchange group, please add the AINLP assistant WeChat (id: ainlp2), noting work/research direction + purpose of joining the group.

If you have read this far, please share, like, or comment on one of the three 🙏

Leave a Comment