Introduction

The moon of ancient times is unseen by people today, yet this month once shone upon the ancients. Hello everyone, I am the little girl selling hot dry noodles. I am very glad to share cutting-edge technologies and thoughts in the field of artificial intelligence with my friends. Overview of Qwen Series Technology 1 - The Evolution of Qwen

With the rapid development of Large Language Models (LLMs), the expansion of model and data scales, combined with techniques such as pre-training, supervised fine-tuning (SFT), and reinforcement learning from human feedback (RLHF), the capabilities of LLMs in language understanding, generation, and reasoning tasks have significantly improved. At the same time, the open-source community has seen a surge of open-weight LLMs, such as the Llama series, Mistral series, ChatGLM series, DeepSeek series, and Qwen series. These open-source models have lowered the barriers to use and accelerated the development of AI applications through community collaboration.

Among these open-source models, Llama and Qwen are undoubtedly in the first tier of the open-source large model arena. With the continuous iteration of the Qwen model, the performance of Qwen2.5-72B is comparable to Llama-3-405B, making Qwen a model worth looking forward to in the open-source domain.

Qwen (pronounced ‘kùn’, meaning ‘thousand questions’) is a series of large-scale language models developed and open-sourced by Alibaba, first released in August 2023. Through continuous iterations, the Qwen series has launched Qwen1.5, Qwen2, Qwen2.5, as well as Qwen-VL and Qwen-VL2, which focus on visual language tasks. By optimizing model architecture, expanding training data, and introducing advanced methods, Qwen has demonstrated exceptional capabilities in language understanding, reasoning, code generation, and multimodal tasks, driving the development of the open-source LLM community.

This article and subsequent series will delve into the evolution of Qwen, training data, model architecture, training methods, and various technical details. This is the first article:Overview of Qwen Series Technology 1 – The Evolution of Qwen.

Subsequent articles in the series will be published in succession:

Overview of Qwen Series Technology 2 – Qwen’s Data Processing

Overview of Qwen Series Technology 3 – Qwen’s Model Architecture

Overview of Qwen Series Technology 4 – Qwen’s Training Methods

Overview of Qwen Open-source Models

Let’s take a direct look at the model series of Qwen.

The Qwen model is based on a Transformer decoder-only architecture, optimized with techniques such as RoPE positional embedding and SwiGLU activation function. The Qwen series of open-source models covers various task domains, including general language models, visual language models, audio language models, code generation models, and mathematical reasoning models. Here are the main model categories and their scales:

Universal Qwen (Qwen): Language Model

[Qwen]: 1.8B, 7B, 14B, and 72B models
[Qwen1.5]: 0.5B, 1.8B, 4B, 14B, 27B, 7B, 14B, 32B, 72B, and 110B models
[Qwen2]: 0.5B, 1.5B, 7B, 14B, and 72B models
[Qwen2.5]: 0.5B, 1.5B, 3B, 7B, 14B, 32B, and 72B models

Universal Qwen VL (Qwen-VL): Visual Language Model

[Qwen-VL]: Based on the 7B model
[Qwen2-VL]: Based on 2B, 7B, and 72B models

Universal Qwen Audio: Audio Language Model

[Qwen-Audio]: Based on the 7B model
[Qwen2-Audio]: Based on the 7B model

Code Universal Qwen / Universal Qwen Coder: Code Language Model

[CodeQwen1.5]: 7B model
[Qwen2.5-Coder]: 7B model

Universal Qwen Math: Mathematical Language Model

[Qwen2-Math]: 1.5B, 7B, and 72B models
[Qwen2.5-Math]: 1.5B, 7B, and 72B models

In addition, the Qwen series also provides closed-source models, including two MoE variants: Qwen2.5-Turbo and Qwen2.5-Plus, which users can experience through Alibaba Cloud Model Studio.

The Evolution of Qwen

Qwen1 Series

Qwen1 is the beginning of the Qwen series, which includes not only the base model but also the Chat model optimized through supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF). The model scales cover 7B and 14B, with the MATH-QWEN-CHAT model performing exceptionally well in mathematical reasoning tasks, with results close to ChatGPT 3.5.

Training Data: A high-quality dataset containing trillions of tokens has been constructed, covering web documents, encyclopedias, books, code, and other multi-domain data, along with deduplication, filtering, and instruction data integration preprocessing operations. The total amount of multilingual data reaches 3T tokens (where “T” stands for “trillion”, meaning 3 trillion), primarily in Chinese and English.
Model Scale Configuration: Provides various scales of 1.8B, 7B, 14B, etc., with different hyperparameters for hidden layer size, number of heads, and number of layers.
Context Length: Supports a context length range of 2048 tokens.
Qwen-Chat: Optimized through SFT and RLHF techniques, it possesses capabilities for chat, text generation, summarization, translation, code generation, and mathematical reasoning.
Performance: Shows outstanding performance in benchmark tests such as MMLU, C-Eval, and GSM8K, surpassing open-source models of the same scale and approaching the level of GPT-3.5 and GPT-4.
MATH-QWEN-CHAT: Its specialized models for coding and mathematical reasoning, CODE – QWEN and MATH – QWEN – CHAT, also demonstrate good performance in their respective fields, achieving certain results in code generation and mathematical problem-solving benchmark tests.
Tool Invocation Capability: Optimized for integration with external systems, possessing strong tool invocation capabilities.

Qwen1.5 Series

The Qwen1.5 series further expands the model scale, providing a range of Base and Chat models from 0.5B to 72B, and introduces MoE (Mixture of Experts) models. Additionally, Qwen1.5 supports various quantization models (such as Int4, Int8, AWQ, and GGUF) and is deeply integrated with frameworks like vLLM, SGLang, AutoAWQ, making deployment and fine-tuning easier.

Dataset: No changes were made to the dataset.
Model Scale Configuration: Qwen1.5 includes 0.5B, 1.8B, 4B, 7B, 14B, and 72B models, totaling six different scales of Base and Chat models, as well as one MoE model. Quantized models corresponding to each size have also been released.
Context Length: All models have achieved support for a context length range of 32768 tokens.
Quantized Models: Not only providing Int4 and Int8 GPTQ models as before, but also offering AWQ and GGUF quantized models.
Framework Integration: Collaborates with vLLM, SGLang (for deployment), AutoAWQ, AutoGPTQ (for quantization), Axolotl, LLaMA-Factory (for fine-tuning), and llama.cpp (for local LLM inference).
Performance: Qwen1.5 shows strong performance across various benchmarks at different model sizes. In particular, Qwen1.5-72B significantly outperforms Llama2-70B in language understanding, reasoning, and mathematical tasks, although there is still a gap compared to GPT-4.
Multilingual Capability: Optimized multilingual support has improved performance on non-English tasks.
Long Sequences: Enhanced long context processing capabilities, supporting longer input sequences.
Linking External Systems: Further optimized tool invocation capabilities, supporting efficient integration with external systems.

Qwen2 Series

The Qwen2 series has achieved significant improvements in model scale, training data, and multilingual support:

Dataset: Pre-training data has reached 7T tokens, covering a wide range of fields and about 30 languages, with significant improvements in quality, scale, and diversity, and optimized for long context training. Post-training data is constructed through collaborative annotation and automated synthesis, covering multiple domains and laying the foundation for the model’s performance in complex tasks.
Diverse Scales: Offers five scales of 0.5B, 1.5B, 7B, 14B, and 72B, including dense models and MoE models.
Context Length: All pre-trained models support a context length of 32K tokens. When using methods like YARN, Qwen2-7B-Instruct and Qwen2-72B-Instruct support context lengths of up to 128K tokens.
Multilingual Capability: Increased data for 27 languages, optimizing language conversion issues and reducing the probability of language conversion, with stronger multilingual processing capabilities.
Code and Mathematical Ability: Outstanding performance in programming and mathematical tasks.
Powerful Context Processing: Increased context length support, up to 128K tokens.
Outstanding Performance: Leading in multiple evaluation benchmarks, surpassing similar models.

Qwen2.5 Series

The Qwen2.5 series is the latest version of Qwen, further enhancing model performance and training data scale, as well as specifically targeting programming with Qwen2.5-Coder and mathematics with Qwen2.5-Math models. All open-weight models are dense, decoder-only language models.

Model Scale: Provides various versions of different scales, including:

Qwen2.5: 0.5B, 1.5B, 3B, 7B, 14B, 32B, and 72B;
Qwen2.5-Coder: 1.5B, 7B, and the upcoming 32B;
Qwen2.5-Math: 1.5B, 7B, and 72B.

Training Data: Pre-training data has expanded to 18T tokens, significantly enhancing common sense, professional knowledge, and reasoning abilities. The post-training phase includes over 1 million pieces of data from SFT and multi-stage RL (such as offline DPO and online GRPO), enhancing human preference alignment and instruction tracking capabilities.
Context Length: Supports a context length of up to 128K tokens, capable of generating text up to 8K tokens. The context length of Qwen2.5-Turbo has been extended from 128K to 1M.
Language Support: Supports over 29 languages, including Chinese, English, French, Spanish, Portuguese, German, Italian, Russian, Japanese, Korean, Vietnamese, Thai, Arabic, etc.
Open-source and Closed-source Models: Open-sourced various scales of Base and Instruct models, while closed-source models include Qwen2.5-Turbo and Qwen2.5-Plus, which are used for API services.
Performance: The flagship model Qwen2.5-72B-Instruct performs excellently in language understanding, reasoning, mathematics, and coding tasks, surpassing many open-source and closed-source models, comparable to Llama-3-405B-Instruct.

Through continuous iterations and optimizations, Qwen2.5 stands as the latest open-sourced series, demonstrating powerful capabilities in language understanding, reasoning, code generation, and multimodal tasks, becoming an important driving force in the open-source LLM field.

Framework Ecosystem Supported by Qwen

The framework ecosystem supported by Qwen is very rich, covering fine-tuning, quantization, deployment, etc., as shown below:

Fine-tuning (Axolotl, Llama-Factory, Firefly, Swift, XTuner)

Quantization (AutoGPTQ, AutoAWQ, Neural Compressor)

Deployment (vLLM, SGL, SkyPilot, TensorRT-LLM, OpenVino, TGI)

Local Running (MLX, Llama.cpp, Ollama, LM Studio)

Agent and RAG (Retrieval-Augmented Generation) frameworks (LlamaIndex, CrewAI, OpenDevin) Evaluation (LMSys, OpenCompass, Open LLM Leaderboard)

Model Secondary Development (Dolphin, Openbuddy)

Simple Applications of Qwen

Simple examples demonstrating the invocation and deployment of Qwen.

Example of Invoking Qwen2.5 Model Based on OpenAI Library:

from openai import OpenAI
import os

client = OpenAI(
    api_key=os.getenv("YOUR_API_KEY"),
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
)

completion = client.chat.completions.create(
    model="qwen-plus-latest",
    messages=[
      {'role': 'user', 'content': 'Tell me something about large language models.'}
    ]
)  
print(completion.choices[0].message.content)

Example of Using Qwen2.5 Based on Transformers:

from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "Qwen/Qwen2.5-7B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
prompt = "Tell me something about large language models."
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=512
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

Example of Offline Inference Using Qwen Based on VLLM:

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

# Initialize the tokenizer
 tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")

# Pass the default decoding hyperparameters of Qwen2.5-7B-Instruct
# max_tokens is for the maximum length for generation.
sampling_params = SamplingParams(temperature=0.7, top_p=0.8, repetition_penalty=1.05, max_tokens=512)

# Input the model name or path. Can be GPTQ or AWQ models.
llm = LLM(model="Qwen/Qwen2.5-7B-Instruct")

# Prepare your prompts
prompt = "Tell me something about large language models."
messages = [
    {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

# generate outputs
outputs = llm.generate([text], sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Example of Deploying and Invoking Qwen2.5 Using VLLM

To run Qwen2.5 with vLLM and deploy a service compatible with the OpenAI API, you can run the following command:

python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-7B-Instruct

Then you can chat with Qwen2.5 via <span>curl</span>:

curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "Qwen/Qwen2.5-7B-Instruct",
  "messages": [
    {"role": "user", "content": "Tell me something about large language models."}
  ],
  "temperature": 0.7,
  "top_p": 0.8,
  "repetition_penalty": 1.05,
  "max_tokens": 512
}'