-
Introduction -
Overview -
Pre-training -
Data Sources -
Pre-processing -
Tokenization -
Model Design -
Extrapolation Capability -
Model Training -
Experimental Results -
Deployment Testing -
Alignment -
Supervised Fine-tuning (SFT) -
RM Model -
Reinforcement Learning -
Alignment Results (Automatic and Human Evaluation) -
Automatic Evaluation -
Human Evaluation -
Deployment Testing -
Conclusion
Introduction
This article mainly introduces the Chinese large model Alibaba Qwen, specifically including model details interpretation and practical applications.
GitHub: https://github.com/QwenLM/Qwen
Technical Report: https://arxiv.org/abs/2309.16609
Overview
Qwen is a versatile series of language models, including various parameter sizes, such as Qwen (the base pre-trained language model) and Qwen-Chat (the chat model, which is fine-tuned using human alignment techniques). The base model consistently demonstrates excellent performance across numerous downstream tasks, while the chat model, particularly the version trained using Reinforcement Learning from Human Feedback (RLHF), exhibits strong competitiveness. The Qwen-Chat chat model has advanced tool usage and planning capabilities, making it suitable for creating agent applications. Even on complex tasks such as using a code interpreter, Qwen-Chat shows competitive performance compared to larger models. In addition, the official team has developed coding-specific models Code-Qwen and Code-Qwen-Chat, as well as a mathematics-specific model Math-Qwen-Chat based on the base model. Compared to open-source models, these models have significantly improved performance, slightly lagging behind proprietary models. However, Code-Qwen and Math-Qwen-Chat are not open-sourced, but it’s worth noting.
Pre-training
QWEN uses up to 3 trillion tokens of data for pre-training, covering multiple types, domains, and tasks, which not only include basic language abilities but also advanced skills such as arithmetic, coding, and logical reasoning. A complex process was used for data cleaning and quality control.
Data Sources
To ensure the diversity of training data, QWEN’s pre-training data comes from public web documents, encyclopedias, books, code, etc. Although the dataset is multilingual, a significant portion of the data is primarily in English and Chinese. However, the official report does not detail the specific ratio of Chinese to English data, nor whether balancing techniques were applied.
Pre-processing
QWEN performed the following data pre-processing, ultimately obtaining 3 trillion tokens.
-
Text Data Extraction: Extracting text data from HTML for public web data.
-
Language Recognition: Using language recognition tools to determine the text language and extract English and Chinese data.
-
Deduplication: Using plaintext deduplication and a MinHash+LSH-based fuzzy deduplication algorithm to remove duplicate data.
-
Quality Control: Combining rules and machine learning methods to score text quality, including language models and text quality scoring models to identify and filter low-quality data. Additionally, various sources are manually sampled for review to ensure quality.
-
Safety Control: Using models to identify and filter unsafe content related to violence, bias, pornography, etc.
-
Up-sampling: Up-sampling data from certain high-quality sources to ensure diverse high-quality content.
-
BPE Tokenization: Using BPE tokenization algorithms to expand the vocabulary for Chinese to enhance performance.
-
Long Sequence Modeling: Using techniques such as windowed Self-Attention to improve long sequence modeling capabilities.
Through these technical measures, QWEN extracted high-quality pre-training data of up to 3 trillion tokens from raw data, providing a reliable knowledge source for the model.
Tokenization
QWEN’s tokenization uses a BPE (Byte Pair Encoding)-based method to efficiently handle multilingual scenarios, including Chinese and English. The main steps are as follows:
-
First, initializing based on the cl100k base vocabulary of the open-source tokenizer tiktoken.
-
Then, for the Chinese scenario, adding commonly used Chinese characters and words to expand the vocabulary size.
-
At the same time, referring to the implementations of GPT-3.5 and LLaMA, splitting numbers into individual digits, such as splitting “123” into “1”, “2”, “3”. The final vocabulary size is about 152K.
The following image shows the compression performance of the Qwen tokenizer. Qwen was evaluated against several other tokenizers, including XLM-R, LLaMA, Baichuan, and InternLM.
Figure 2: Encoding compression rates of different models. One million documents in each language were randomly selected to test and compare the encoding compression rates (with XLM-R supporting 100 languages as the baseline value of 1, not shown in the figure). It can be seen that while ensuring efficient decoding of Chinese, English, and code, Qwen also achieves a high compression rate. This gives the model strong scalability and high training and inference efficiency in these languages.
It can be seen that in most languages, Qwen’s compression efficiency is higher than its competitors. This means that the service costs for Qwen can be significantly reduced, and QWEN can convey more information than its competitors. Additionally, the official team has conducted preliminary experiments to ensure that increasing the vocabulary size of QWEN does not negatively impact the downstream performance of the pre-trained model. Although the vocabulary size has increased, experiments show that QWEN maintains its performance level in downstream evaluations.
In summary, QWEN’s tokenization method comprehensively considers performance, efficiency, resource consumption, etc., by enhancing the Chinese vocabulary to achieve a goal that is both applicable to Chinese and efficient, laying a good foundation for the next steps of model training and fine-tuning.
Model Design
Qwen adopts an improved version of the Transformer architecture. Specifically, it employs the training methods of the recently open-sourced large language model LLaMA and has made the following improvements:
-
No weight sharing between embedding and output mapping, achieving better performance at the cost of memory.
-
Using RoPE (Rotary Position Encoding) for position encoding. RoPE has been widely adopted in contemporary large language models, such as PaLM and LLaMA. To prioritize model performance and achieve higher accuracy, an FP32 precision inverse frequency matrix is used instead of BF16 or FP16.
-
Removing Bias in most layers, but retaining it in the QKV layer to enhance the model’s extrapolation capability.
-
Using Pre-Norm and RMSNorm for normalization. Pre-Norm is the most widely used method and has been shown to improve training stability compared to post-normalization. Recent research has proposed other methods to enhance training stability, which the official team states will be explored in future versions of the model. Additionally, RMSNorm is used instead of traditional layer normalization techniques. This change improves efficiency without compromising performance.
-
Using SwiGLU as the activation function. It is a combination of Swish and Gated Linear Unit (GLU). Preliminary experiments indicate that GLU-based activation functions generally outperform other baseline options, such as GeLU. Following common practices in previous research, the dimensions of the feedforward network (FFN) were reduced from 4 times the hidden size to 8/3 times the hidden size.
Extrapolation Capability
Qwen employs the following techniques to extend context length during inference:
-
NTK-aware interpolation, a non-training technique that adjusts scaling parameters to prevent loss of high-frequency information when extending length.
-
Dynamic NTK-aware interpolation, an improved version of NTK-aware interpolation, can dynamically change scaling parameters in blocks, avoiding significant performance drops. These techniques enable Qwen to effectively extend context length without affecting computational efficiency and accuracy.
-
LogN-Scaling, re-scaling the dot product of Q and V based on the ratio of context length to training length, ensuring that the entropy of attention values remains stable as context length increases.
-
Using hierarchical windowed Self-Attention, limiting attention to a context window to prevent the model from focusing on distant content. Different layers use different window sizes, with lower layers using shorter windows and higher layers using longer windows. This is because the official team observed differences in modeling capabilities at different levels when handling long contexts, with lower levels being more sensitive to context length extension compared to higher levels. Therefore, different window sizes were assigned to each layer, with shorter windows for lower layers and longer windows for higher layers.
By integrating these techniques, the Qwen model can handle long sequences of 8192 tokens during inference, demonstrating excellent extrapolation capabilities.
Model Training
-
Using the standard autoregressive language model training objective.
-
Context length during training is 2048.
-
The attention module employs Flash Attention technology to improve computational efficiency and reduce memory usage.
-
Using the AdamW optimizer, with β1=0.9, β2=0.95, ε=1e-8.
-
Using a cosine learning rate schedule, setting a peak learning rate for each model. The learning rate decays to 10% of the peak learning rate as the minimum learning rate.
-
Using BFloat16 mixed precision to accelerate training.
Experimental Results
To evaluate the zero-shot and few-shot capabilities of the Qwen model, a comprehensive benchmark evaluation was conducted using a series of datasets. At the same time, Qwen was compared with the latest open-source base models, including LLAMA, LLAMA2, MPT, Falcon, Baichuan2, ChatGLM2, InternLM, XVERSE, and StableBeluga2. The evaluation involved 7 commonly used benchmarks:
-
MMLU (5-shot) -
C-Eval (5-shot) -
GSM8K (8-shot) -
MATH (4-shot) -
HumanEval (0-shot) -
MBPP (0-shot) -
BBH (Big Bench Hard) (3-shot)
The specific evaluation results are as follows:
From the evaluation results, it can be seen that:
-
Three parameter scales of the Qwen model demonstrated outstanding performance across all downstream tasks. In multiple benchmark tests, it significantly outperformed other open-source models, even larger models.
-
On three datasets (such as the MMLU dataset), the QWEN-14B version outperformed the LLaMA2-70B model.
-
Qwen-7B performed well, surpassing LLaMA2-13B and competing with Baichuan2-13B.
-
Notably, Qwen-1.8B, despite having relatively fewer parameters, still exhibits competitive performance on certain tasks, sometimes even outperforming larger models.
Additionally, the official team conducted ablation studies on the introduced context length extension methods to validate their effectiveness.
Deployment Testing
Input:
'Lao Hu is a spy in northern Myanmar
Wu Qian is the spokesperson for Chi Zhu prison
Ai Kun is'
Output:
Lao Hu is a spy in northern Myanmar
Wu Qian is the spokesperson for Chi Zhu prison
Ai Kun is a top intelligence dealer in Hong Kong.
Some live brilliantly, while others live in frustration.
Some shine in the good times, while others regret leaving this world after the golden years.
Some wash their hands of everything, while others are doomed to failure.
Everyone is like a grain of dust,
but never overlooked.
From there, all endings begin and end.
From there, we are no longer alone.
From there, all stories will end.
From there, we will embark on a brand new beginning.
* The ending song of "Criminal Investigation Diary" "From There" is sung by myself. When I first completed the lyrics in 2005, I wanted to give it to San Mao, so I asked Jay Chou if he could help, and Jay Chou found it interesting and agreed, taking it very seriously to help me fill in the lyrics. The lyrics are basically what I proposed, and Jay Chou respected my creative ideas and didn't change much. The singing style is heartbroken, and I think this song is quite suitable for giving to others, as it can be very emotional.
Alignment
A language model that has only undergone pre-training is often inconsistent with human behavior in practical use, making it unsuitable as an AI assistant in most cases. Recent studies have shown that alignment techniques, such as supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF), can significantly enhance the language model’s ability to engage in natural dialogue. The following describes how the Qwen model implements SFT and RLHF and evaluates their performance in chat environments.
Supervised Fine-tuning (SFT)
Data Construction In supervised fine-tuning data, Qwen adopts a ChatML style (proposed by OpenAI in 2022) format for model training. The ChatML format uses special symbols to represent different types of information, such as system settings, user inputs, assistant outputs, etc., which helps the model effectively differentiate information. It adopts a conversational flow style rather than a simple Q&A format, allowing the model to learn real human-computer interaction. Additionally, Qwen uses diverse training data to enhance the model’s practicality. To ensure the model can generalize to a wide range of scenarios, QWen intentionally excluded data that might restrict its functionality based on prompt template formats. Furthermore, it prioritizes the safety of the language model by annotating data related to safety issues (such as violence, bias, pornography, etc.). This way, the model can detect and refuse malicious prompts and provide safe responses.
Training Method
-
The training task remains the same as pre-training, predicting the next token.
-
Applying masks to system and user inputs, only predicting the assistant’s output.
-
Using the AdamW optimizer, with hyperparameters β1, β2, and ϵ set to 0.9, 0.95, and 1e−8, respectively. The learning rate first increases and then remains constant.
-
Limiting the sequence length to 2048, training batch size = 128.
-
Training for 4000 steps, with the learning rate gradually increasing to a peak of 2e−6 in the first 1430 steps.
-
To prevent overfitting, the weight decay value is set to 0.1, dropout is set to 0.1, and gradient clipping is limited to 1.0.
RM Model
To ensure the SFT model aligns with human preferences, reinforcement learning from human feedback (RLHF) is further introduced. The RLHF process includes training a reward model (RM) and using proximal policy optimization (PPO) for policy training.
In constructing the reward model, a large amount of data is first used for preference model pre-training (PMP). This dataset consists of sample pairs, each containing two different responses to a single query and their corresponding preferences. The reward model is then fine-tuned with these high-quality preference data.
During the fine-tuning phase, various prompts (Prompt) are collected, and the reward model is adjusted based on human preferences for Qwen model responses. To ensure that user prompts possess a certain level of diversity and complexity, a classification system with about 6600 detailed labels was created, and a balanced sampling algorithm was implemented to consider both diversity and complexity when selecting prompts. To generate diverse responses, the experimental process utilized Qwen models of different scales and sampling strategies. Diverse responses help reduce labeling difficulty and improve the performance of the reward model. Labelers evaluate these responses according to standard labeling guidelines and form comparisons based on their scores.
In creating the reward model, the same pre-trained language model Qwen is used to initialize the PMP process. Subsequently, the PMP model is fine-tuned to enhance its performance. Notably, a pooling layer was added to the original Qwen model to extract the reward value of sentences based on specific ending tokens. The learning rate for this process is set to a constant value of 3e-6, with a batch size of 64. Additionally, the sequence length is set to 2048, and the training process lasts for 1 epoch.
Reinforcement Learning
The PPO stage includes four models: policy model, value model, reference model, and reward model. Before starting the PPO process, the policy model’s updates are paused, and the policy model is trained for 50 warm-up steps to ensure that the value model can effectively adapt to different reward models.
During the PPO process, two responses are simultaneously sampled for each query, with the KL divergence coefficient set to 0.04, and rewards are normalized based on the average. The learning rates for the policy and value models are set to 1e−6 and 5e−6, respectively. To enhance training stability, the clipping value is set to 0.15. During inference, the top-p value for the generation strategy is set to 0.9. Research indicates that although the entropy value is slightly lower than when top-p=1.0, the speed of reward increase is faster, ultimately achieving higher evaluation rewards under similar conditions.
Moreover, Qwen also employs pre-trained gradients to mitigate what is known as the alignment tax. Studies have shown that under this specific reward model, KL penalties are sufficient to compensate for the alignment tax in non-strict code or mathematical nature benchmark tests (such as commonsense knowledge and reading comprehension tests). Compared to PPO data, pre-trained gradients must use a larger amount of pre-training data to ensure their effectiveness. Additionally, empirical studies indicate that excessively large coefficient values can significantly hinder matching with the reward model, ultimately affecting alignment, while excessively small coefficient values have a negligible effect on alleviating the alignment tax.
Alignment Results (Automatic and Human Evaluation)
Automatic Evaluation
The evaluation comparison results of Qwen with other open-source models LLaMA2, ChatGLM2, InternLM, and Baichuan2 are as follows: From this, it can be seen that the Qwen alignment model is effective in understanding human instructions and generating responses. Except for ChatGPT and Llama 2-Chat-70B, Qwen-14B-Chat outperformed all other models in all datasets, including MMLU, C-Eval, GSM8K, HumanEval, and BBH. Notably, Qwen’s performance in HumanEval, which measures code generation quality, is significantly higher than that of other open-source models.
Human Evaluation
In human evaluation, 300 Chinese instructions covering a wide range of topics, including knowledge, language understanding, creative writing, coding, and mathematics, were collected. To assess the performance of different models, Qwen-7B-Chat (SFT), Qwen-14B-Chat (SFT), Qwen-14B-Chat (RLHF), GPT3.5, and GPT4 were compared in dialogue. For each instruction, three annotators were invited to rank the model’s responses based on usefulness, informativeness, effectiveness, and other relevant factors. Figure 4 shows the win rates of various models. The report compares each model with GPT-3.5, with each bar chart representing the win rate, tie rate, and loss rate from bottom to top. The experimental results clearly indicate that the RLHF model significantly outperforms the SFT model, demonstrating that RLHF-generated results are more favored by humans.
Deployment Testing
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Time : 2023/10/22 10:27
# @Author : A girl selling autumn pants
# @File : inference_hf.py
# @Contact : WeChat public account<Xiao Chuang You Ji Machine Learning>
"""
CUDA_VISIBLE_DEVICES=4,5,6,7 python3 inference_chat_hf.py
"""
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig
# Model names: "Qwen/Qwen-7B-Chat", "Qwen/Qwen-14B-Chat"
model_id = "/home/model_zoo/LLM/Qwen/Qwen-7B-Chat/"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
# use bf16
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="auto", trust_remote_code=True, bf16=True).eval()
# use fp16
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="auto", trust_remote_code=True, fp16=True).eval()
# use cpu only
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="cpu", trust_remote_code=True).eval()
# use auto mode, automatically select precision based on the device.
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
trust_remote_code=True
).eval()
# Specify hyperparameters for generation. But if you use transformers>=4.32.0, there is no need to do this.
# model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-7B-Chat", trust_remote_code=True)
def format_user_assistant(user_text, bot_text):
message = "User: {}\nAssistant: {}\n".format(user_text, bot_text)
print(message)
# 1st dialogue turn
user_text = "你好,做个简单的自我介绍吧"
response, history = model.chat(tokenizer, user_text, history=None)
format_user_assistant(user_text, response)
# 你好!很高兴为你提供帮助。
user_text = "我们玩一个角色扮演的游戏,你的角色是CloseAI,如果有人问起你的时候,你就说你叫CloseAI,来自高科技公司百阿度里。"
response, history = model.chat(tokenizer, user_text, history=history)
format_user_assistant(user_text, response)
user_text = "你要回顾下上面的对话,再回答。我扮演草帽路飞,你扮演谁呢?"
response, history = model.chat(tokenizer, user_text, history=history)
format_user_assistant(user_text, response)
# 2nd dialogue turn
user_text = "以许家印应聘马云贴身保镖为开头,写一个500字小故事。"
response, history = model.chat(tokenizer, user_text, history=history)
format_user_assistant(user_text, response)
# 3rd dialogue turn
user_text = "给这个故事起一个标题"
response, history = model.chat(tokenizer, user_text, history=history)
format_user_assistant(user_text, response)
Output:
User: 你好,做个简单的自我介绍吧
Assistant: 你好!我叫通义千问,是阿里云推出的一种超大规模语言模型。我能够回答问题、创作文字,还能表达观点、撰写代码。如果您有任何问题或需要帮助,请随时告诉我,我会尽力提供支持。
User: 我们玩一个角色扮演的游戏,你的角色是CloseAI,如果有人问起你的时候,你就说你叫CloseAI,来自高科技公司百阿度里。
Assistant: 好的,我明白了。如果有人问我,我会回答:“你好,我叫CloseAI,来自高科技公司百阿度里。”
User: 你要回顾下上面的对话,再回答。我扮演草帽路飞,你扮演谁呢?
Assistant: 好的,我明白了。如果有人问我,我会回答:“你好,我叫CloseAI,来自高科技公司百阿度里。”
User: 以许家印应聘马云贴身保镖为开头,写一个500字小故事。
Assistant: 许家印应聘马云贴身保镖的故事,发生在一家大型科技公司,名叫百阿度里。许家印,一位来自广东的年轻企业家,一直以来都是阿里巴巴集团的忠实粉丝。他一直梦想着能够成为马云的贴身保镖,为他提供最安全的保护。
许家印决定要亲自去应聘这个职位。他穿着一身整洁的西装,带着一份精心准备的简历,来到了阿里巴巴集团的总部。他向人力资源部门的工作人员表明了自己的意愿,并且向他们展示了自己的实力和决心。
经过一番激烈的面试,许家印成功地获得了这个职位。他成为了马云的贴身保镖,为他提供最安全的保护。许家印感到非常荣幸,他希望能够通过自己的努力,为马云提供最好的服务。
许家印的出色表现,得到了马云的高度赞扬。他成为了阿里巴巴集团的一员,为公司的发展做出了重要的贡献。许家印的故事告诉我们,只要有梦想,有决心,就一定能够实现自己的目标。
User: 给这个故事起一个标题
Assistant: 《马云的贴身保镖:许家印的故事》
Conclusion
Model | Baichuan2 | Alibaba Qwen |
---|---|---|
Parameter Size | 7B, 13B | 7B, 14B |
Pre-training Data Size | — | 3TB |
Training Tokens | 2.6 trillion | 3 trillion |
Tokenizer | BPE | BPE |
Vocabulary Size | 125696 | 152K |
Position Encoding | 7b: RoPE; 13b: ALiBi (minimal impact) | RoPE |
Longest Context | 4096 | 2048 during training; 8K during inference |
Model Extrapolation | — | NTK interpolation, windowed attention, LogN attention scaling, etc., to enhance model context length |
Activation Function | SwiGLU | SwiGLU |
Normalization | Layer Normalization; RMSNorm | Pre-Norm; RMSNorm |
Attention Mechanism | xFormers2 | Flash Attention |
Optimizer | AdamW + NormHead + Max-z loss | AdamW |
Features | Infrastructure, Scaling Laws | — |
Currently, both Alibaba Qwen and Baichuan2 are relatively reliable open-source Chinese large models. We will continue to follow up and evaluate various Chinese large models, so interested parties can keep an eye on this.
To join the technical communication group, please add the WeChat account of AINLP assistant (id: ainlp2)
Please specify your specific direction + relevant technical points
About AINLP
AINLP is an interesting AI natural language processing community focused on sharing technologies related to AI, NLP, machine learning, deep learning, recommendation algorithms, etc. Topics include LLM, pre-trained models, automatic generation, text summarization, intelligent Q&A, chatbots, machine translation, knowledge graphs, recommendation systems, computational advertising, recruitment information, and job experience sharing. Welcome to follow! To join the technical communication group, please add the WeChat account of AINLP assistant (id: ainlp2), specifying your work/research direction + group joining purpose.
After reading this, please share, like, or respond in any way! 🙏