Qwen2.5-1M: Open Source Model Supporting 1 Million Tokens Context

01
Introduction

Two months ago, the Qwen team upgraded Qwen2.5-Turbo to support a context length of up to one million tokens. Today, Qwen officially launched the open-source Qwen2.5-1M model along with its corresponding inference framework support. Here are the highlights of this release:

Open Source Models: This release includes two new open-source models, namely Qwen2.5-7B-Instruct-1M and Qwen2.5-14B-Instruct-1M, marking the first time that the open-source Qwen model’s context has been extended to 1M length.

Inference Framework: To help developers deploy the Qwen2.5-1M series models more efficiently, the Qwen team has fully open-sourced an inference framework based on vLLM, integrating sparse attention methods, which enhances the framework’s speed by 3 to 7 times when processing 1M token inputs.

Technical Report: The Qwen team also shared the technical details behind the Qwen2.5-1M series, including the design concepts of the training and inference frameworks as well as the results of ablation experiments.

Model Links:

https://www.modelscope.cn/collections/Qwen25-1M-d6cf9fd33f0a40

Technical Report:

https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5-1M/Qwen2_5_1M_Technical_Report.pdf

Experience Link:

https://modelscope.cn/studios/Qwen/Qwen2.5-1M-Demo

02
Model Performance

First, let’s take a look at the performance of the Qwen2.5-1M series models in long context tasks and short text tasks.

Long Context Tasks

In the Passkey Retrieval task with a context length of one million tokens, the Qwen2.5-1M series models can accurately retrieve hidden information from a 1M length document, with only the 7B model showing a few errors.

Qwen2.5-1M: Open Source Model Supporting 1 Million Tokens Context

For more complex long context understanding tasks, the RULER, LV-Eval, and LongbenchChat test sets were selected.

Qwen2.5-1M: Open Source Model Supporting 1 Million Tokens Context

Qwen2.5-1M: Open Source Model Supporting 1 Million Tokens Context

From these results, the following key conclusions can be drawn:

  • Significantly Outperforming the 128K Version: The Qwen2.5-1M series models significantly outperform the previous 128K version in most long context tasks, especially excelling in tasks that exceed 64K length.

  • Clear Performance Advantage: The Qwen2.5-14B-Instruct-1M model not only defeated Qwen2.5-Turbo but also consistently surpassed GPT-4o-mini across multiple datasets, providing an open-source model option for long context tasks.

Short Sequence Tasks

In addition to the performance in long sequence tasks, the model’s performance in short sequences is equally important. The Qwen2.5-1M series models were compared against the previous 128K version and GPT-4o-mini in widely used academic benchmark tests.

Qwen2.5-1M: Open Source Model Supporting 1 Million Tokens Context

It can be seen that:

  • Qwen2.5-7B-Instruct-1M and Qwen2.5-14B-Instruct-1M perform comparably to their 128K versions in short text tasks, ensuring that basic abilities are not compromised by the added long sequence processing capability.

  • Compared to GPT-4o-mini, Qwen2.5-14B-Instruct-1M and Qwen2.5-Turbo achieve similar performance in short text tasks, while having a context length eight times that of GPT-4o-mini.

03
Key Technologies

Long Context Training

Qwen2.5-1M: Open Source Model Supporting 1 Million Tokens Context

Training long sequences requires substantial computational resources, hence a stepwise length expansion method was adopted, progressively extending the context length of Qwen2.5-1M from 4K to 256K in multiple stages:

Starting from an intermediate checkpoint of the pre-trained Qwen2.5, where the context length is 4K.

During the pre-training phase, the context length was gradually increased from 4K to 256K while using the Adjusted Base Frequency scheme, raising the RoPE base frequency from 10,000 to 10,000,000.

In the supervised fine-tuning phase, this was conducted in two stages to maintain performance on short sequences:

First Stage: Fine-tuning was conducted only on short instructions (up to 32K length), using the same data and number of steps as the 128K version of Qwen2.5.

Second Stage: A mix of short instructions (up to 32K) and long instructions (up to 256K) was used to enhance long task performance while maintaining the quality of short tasks.

In the reinforcement learning phase, the model was trained on short texts (up to 8K tokens). We found that even training on short texts, the improvements in human preference alignment generalized well to long context tasks. Through this training, we ultimately obtained an instruct model capable of handling sequences of up to 256K tokens.

Through this training, we obtained an instruct fine-tuned model with a context length of 256K.

Length Extrapolation

During the above training process, the model’s context length was only 256K tokens. To extend it to 1M tokens, a length extrapolation technique was employed.

Currently, large language models based on rotary position encoding experience performance degradation in long context tasks, mainly due to the excessive relative positional distance between Query and Key during attention weight computation, which was not encountered during training. To address this issue, Qwen2.5-1M adopts the Dual Chunk Attention (DCA) method, which remaps excessive relative positions to smaller values, thereby solving this problem.

Qwen2.5-1M: Open Source Model Supporting 1 Million Tokens Context

Evaluations were conducted on the Qwen2.5-1M model and the previous 128K version, testing both with and without the length extrapolation method.

Results indicated that even a model trained only on 32K tokens, such as Qwen2.5-7B-Instruct, can achieve near-perfect accuracy in the Passkey Retrieval task when processing 1M token contexts. This demonstrates the powerful capability of DCA to significantly extend the supported context length without requiring additional training.

Qwen2.5-1M: Open Source Model Supporting 1 Million Tokens Context

Sparse Attention Mechanism

For long context language models, inference speed is crucial for user experience. To accelerate the pre-fill phase, the research team introduced a sparse attention mechanism based on MInference. Additionally, several improvements were proposed:

  • Chunked Prefill: Directly using the model to process sequences of length one million would incur massive memory overhead due to the activation weights of the MLP layer. For example, with Qwen2.5-7B, this overhead can reach up to 71GB. By adapting chunked prefill with sparse attention, input sequences can be chunked into lengths of 32768 and prefilled chunk by chunk, reducing the memory usage of MLP layer activation weights by 96.7%, significantly lowering the device’s memory requirements.

  • Integration of Length Extrapolation Scheme: We further integrated the DCA-based length extrapolation scheme into the sparse attention mechanism, allowing our inference framework to enjoy both higher inference efficiency and accuracy in long sequence tasks.

  • Sparsity Optimization: The original MInference method required offline searching to determine the optimal sparsity configuration for each attention head. Due to the memory demands of full attention weights, this search is typically conducted on short sequences and may not yield good results on longer sequences. We proposed a method capable of optimizing the sparsity configuration on sequences of one million length, significantly reducing the accuracy loss caused by sparse attention.

  • Other Optimizations: We also introduced other optimization measures, such as optimizing operator efficiency and dynamic chunked pipeline parallelism, to fully leverage the potential of the entire framework.

Through these improvements, the inference framework has increased the pre-fill speed for sequences of 1M token length across different model sizes and GPU devices by 3.2 to 6.7 times.

Qwen2.5-1M: Open Source Model Supporting 1 Million Tokens Context

04
Model Deployment

System Preparation

To achieve optimal performance, it is recommended to use GPUs that support optimized kernels with Ampere or Hopper architecture.

Please ensure the following requirements are met:

  • CUDA Version: 12.1 or 12.3

  • Python Version: >=3.9 and <=3.12

Memory requirements for processing sequences of 1M length:

  • Qwen2.5-7B-Instruct-1M: At least 120GB memory (total across multiple GPUs).

  • Qwen2.5-14B-Instruct-1M: At least 320GB memory (total across multiple GPUs).

If GPU memory does not meet the above requirements, you can still use Qwen2.5-1M for shorter task processing.

Install Dependencies

For now, you need to clone the vLLM repository from the custom branch and install it manually. The research team is working on submitting this branch to the vLLM project.

git clone -b dev/dual-chunk-attn git@github.com:QwenLM/vllm.gitcd vllm pip install -e . -v

Start OpenAI Compatible API Service

Specify the model to download from ModelScope

export VLLM_USE_MODELSCOPE=True

Release OpenAI compatible API service

vllm serve Qwen/Qwen2.5-7B-Instruct-1M \  --tensor-parallel-size 4 \  --max-model-len 1010000 \  --enable-chunked-prefill --max-num-batched-tokens 131072 \  --enforce-eager \  --max-num-seqs 1

Parameter explanations:

  • --tensor-parallel-size

    • Set to the number of GPUs you are using. The 7B model supports up to 4 GPUs, while the 14B model supports up to 8 GPUs.

  • --max-model-len

    • Defines the maximum input sequence length. If you encounter memory issues, please reduce this value.

  • --max-num-batched-tokens

    • Sets the chunk size for Chunked Prefill. Smaller values can reduce activation memory usage but may slow down inference speed.

    • The recommended value is 131072 for optimal performance.

  • --max-num-seqs

    • Limits the number of sequences processed concurrently.

Interact with the Model

You can interact with the deployed model using the following methods:

Option 1. Using Curl

curl http://localhost:8000/v1/chat/completions \-H "Content-Type: application/json" \-d '{ "model": "Qwen/Qwen2.5-7B-Instruct-1M", "messages": [ {"role": "user", "content": "Tell me something about large language models."} ], "temperature": 0.7, "top_p": 0.8, "repetition_penalty": 1.05, "max_tokens": 512 }'

Option 2. Using Python

from openai import OpenAIopenai_api_key = "EMPTY" openai_api_base = "http://localhost:8000/v1client = OpenAI( api_key=openai_api_key, base_url=openai_api_base, ) prompt = """There is an important info hidden inside a lot of irrelevant text. Find it and memorize them. I will quiz you about the important information there.

The pass key is 28884. Remember it. 28884 is the pass key.
 " + \ "The grass is green. The sky is blue. The sun is yellow. Here we go. There and back again. " * 800 + \ "
What is the pass key?"""# The prompt is 20k long. You can try longer prompt by replacing 800 by 40000. chat_response = client.chat.completions.create(   model="Qwen/Qwen2.5-7B-Instruct-1M",   messages=[ {"role": "user", "content": prompt}, ],   temperature=0.7,   top_p=0.8,   max_tokens=512,  extra_body={ "repetition_penalty": 1.05, },   ) print("Chat response:", chat_response)

You can also explore other frameworks, such as Qwen-Agent, to enable the model to read PDF files and more.

05
Direct Invocation Using MoDa API-Inference

The MoDa platform’s API-Inference also provides support for Qwen2.5-7B-Instruct-1M and Qwen2.5-14B-Instruct-1M models in real-time. Users of MoDa can directly use the model through API calls. Specific usage methods for API-Inference can be found on the model page (e.g., https://modelscope.cn/models/Qwen/Qwen2.5-14B-Instruct-1M):

Qwen2.5-1M: Open Source Model Supporting 1 Million Tokens Context

Or refer to the API-Inference documentation: https://www.modelscope.cn/docs/model-service/API-Inference/intro

Qwen2.5-1M: Open Source Model Supporting 1 Million Tokens Context

Thanks to Alibaba Cloud’s Bai Lian platform for providing the computing power support behind it.

Using Ollama and Llamafile

To facilitate local use, MoDa has promptly provided the GGUF version and llamafile version of the Qwen2.5-7B-Instruct-1M model. They can be invoked through the Ollama framework or directly using the llamafile.

1. Ollama Invocation

First, enable Ollama:

ollama serve

Then you can run the GGUF model on MoDa using the ollama run command:

ollama run modelscope.cn/modelscope/Qwen2.5-7B-Instruct-1M-GGUF

Running result:

Qwen2.5-1M: Open Source Model Supporting 1 Million Tokens Context

2. Direct Invocation of Llamafile Model

Llamafile provides a solution where the large model and runtime environment are all encapsulated in an executable file. Through the integration of the MoDa command line and llamafile, it is truly possible to run the large model with one click across different operating systems like Linux/Mac/Windows:

modelscope llamafile --model Qwen-Llamafile/Qwen2.5-7B-Instruct-1M-llamafile

Running result:

Qwen2.5-1M: Open Source Model Supporting 1 Million Tokens Context

More documentation can be found at https://www.modelscope.cn/docs/models/advanced-usage/llamafile

06
Model Fine-tuning

Here we introduce how to fine-tune Qwen/Qwen2.5-7B-Instruct-1M using ms-swift.

Before starting fine-tuning, please ensure your environment is correctly set up:

# Install ms-swiftgit clone https://github.com/modelscope/ms-swift.gitcd ms-swiftpip install -e .

We provide a runnable fine-tuning demo and the style of the custom dataset; the fine-tuning script is as follows:

CUDA_VISIBLE_DEVICES=0 
swift sft 
    --model Qwen/Qwen2.5-7B-Instruct-1M 
    --train_type lora 
    --dataset 'AI-ModelScope/alpaca-gpt4-data-zh#500' 
              'AI-ModelScope/alpaca-gpt4-data-en#500' 
              'swift/self-cognition#500' 
    --torch_dtype bfloat16 
    --num_train_epochs 1 
    --per_device_train_batch_size 1 
    --per_device_eval_batch_size 1 
    --learning_rate 1e-4 
    --lora_rank 8 
    --lora_alpha 32 
    --target_modules all-linear 
    --gradient_accumulation_steps 16 
    --eval_steps 50 
    --save_steps 50 
    --save_total_limit 5 
    --logging_steps 5 
    --max_length 2048 
    --output_dir output 
    --warmup_ratio 0.05 
    --dataloader_num_workers 4 
    --model_author swift 
    --model_name swift-robot

Training memory usage:

Qwen2.5-1M: Open Source Model Supporting 1 Million Tokens Context

Custom dataset format: (directly use `–dataset ` to specify)

{"messages": [{"role": "user", "content": "&lt;query&gt;"}, {"role": "assistant", "content": "&lt;response&gt;"}, {"role": "user", "content": "&lt;query2&gt;"}, {"role": "assistant", "content": "&lt;response2&gt;"}]}

Inference script:

CUDA_VISIBLE_DEVICES=0 
swift infer 
    --adapters output/vx-xxx/checkpoint-xxx 
    --stream true 
    --max_new_tokens 2048

Push the model to ModelScope:

CUDA_VISIBLE_DEVICES=0 swift export 
    --adapters output/vx-xxx/checkpoint-xxx 
    --push_to_hub true 
    --hub_model_id '&lt;your-model-id&gt;' 
    --hub_token '&lt;your-sdk-token&gt;'
07
What’s Next?

Although the Qwen2.5-1M series provides an excellent open-source option for long sequence processing tasks, the research team recognizes that there is still significant room for improvement in long context models. Our goal is to create models that excel in both long and short tasks, ensuring they truly perform in real-world application scenarios. To this end, the research team is delving into more efficient training methods, model architectures, and inference techniques, aiming to deploy these models effectively even in resource-constrained environments while achieving optimal performance. The research team firmly believes that these efforts will unlock new possibilities for long context models, significantly expand their application scope, and continue to push the boundaries in this field. Stay tuned!

Click Read Original to jump to the experience~

👇 Click to follow the ModelScope public account for more technical information~
More technical information~

Leave a Comment