Qwen 1.5 Open Source! Best Practices for Magic Adaptation!

In recent months, the Tongyi Qianwen team has been working hard to explore how to build a ‘good’ model while optimizing the developer experience. Just before the Chinese New Year, the Tongyi Qianwen team shared the next version of the Qwen open-source series, Qwen 1.5.

Qwen 1.5 Open Source! Best Practices for Magic Adaptation!

Qwen 1.5 has open-sourced six sizes of foundational and chat models, including 0.5B, 1.8B, 4B, 7B, 14B, and 72B, as well as quantized models. It not only provides Int4 and Int8 GPTQ models but also AWQ models and GGUF quantized models. To enhance the developer experience, the code for Qwen 1.5 has been merged into Hugging Face Transformers, allowing developers to use transformers>=4.37.0 directly without trust_remote_code. Additionally, Qwen 1.5 supports frameworks like vLLM, SGLang, and AutoGPTQ.

Compared to previous versions, Qwen 1.5 significantly improves the consistency of chat models with human preferences and enhances their multilingual capabilities. All models provide unified context length support, supporting 32K contexts. Moreover, the quality of the foundational language model has also seen minor improvements.

00
Key Points from the Editor

1. A More Comprehensive Model Series: Providing 6 different size models, as well as GPTQ/AWQ/GGUF quantized versions, there’s definitely one that suits your needs.

2. Better Ecosystem Integration: Integrates with Hugging Face Transformers and mainstream third-party deployment, quantization, fine-tuning, and service frameworks, making it convenient for everyone.

3. More Powerful Performance: Chat model performance has significantly improved, achieving excellent performance even on English MT-Bench with the Qwen 1.5-Chat series.

4. More Comprehensive and Unified Features: The entire series uniformly supports at least 32k maximum length, with comprehensive improvements in multilingual capabilities and richer multilingual evaluations. The entire series uniformly supports system prompts and possesses powerful external system linking capabilities (agent/RAG/Tool-use/Code-interpreter).

01
Best Practices for Magic Adaptation
Model Experience

Experience Address:

https://modelscope.cn/studios/qwen/Qwen1.5-72B-Chat-Demo/summary

For example, multilingual capabilities:

Qwen 1.5 Open Source! Best Practices for Magic Adaptation!

Role-playing:

Qwen 1.5 Open Source! Best Practices for Magic Adaptation!

Tool invocation capabilities:

Qwen 1.5 Open Source! Best Practices for Magic Adaptation!

Model Download

Model Link: https://modelscope.cn/organization/qwen

from modelscope import snapshot_downloadmodel_dir = snapshot_download('qwen/Qwen1.5-7B-Chat')
Model Inference

Environment Dependencies:

pip install transformers>=4.37.0

Inference Code:

from modelscope import AutoModelForCausalLM, AutoTokenizerdevice = "cuda" # the device to load the model onto
model = AutoModelForCausalLM.from_pretrained(    "qwen/Qwen1.5-0.5B-Chat",    device_map="auto")tokenizer = AutoTokenizer.from_pretrained("qwen/Qwen1.5-7B-Chat")
prompt = "Give me a short introduction to large language model."messages = [    {"role": "system", "content": "You are a helpful assistant."},    {"role": "user", "content": prompt}]text = tokenizer.apply_chat_template(    messages,    tokenize=False,    add_generation_prompt=True)model_inputs = tokenizer([text], return_tensors="pt").to(device)
generated_ids = model.generate(    model_inputs.input_ids,    max_new_tokens=512)generated_ids = [    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
Model Training

The SWIFT fine-tuning framework from the Magic Adaptation community (https://github.com/modelscope/swift) has supported fine-tuning and inference for the entire series of Qwen 1.5 models.

Below, we provide the training parameter configuration for the self-cognition task using the Qwen 1.5-7B-Chat model as an example:

# Experimental environment: A100# 30GB GPU memoryPYTHONPATH=../../.. 
CUDA_VISIBLE_DEVICES=0 
python llm_sft.py 
    --model_type qwen1half-7b-chat 
    --sft_type lora 
    --tuner_backend swift 
    --dtype AUTO 
    --output_dir output 
    --dataset ms-bench 
    --train_dataset_sample 5000 
    --num_train_epochs 2 
    --max_length 1024 
    --check_dataset_strategy warning 
    --lora_rank 8 
    --lora_alpha 32 
    --lora_dropout_p 0.05 
    --lora_target_modules ALL 
    --gradient_checkpointing true 
    --batch_size 1 
    --weight_decay 0.01 
    --learning_rate 1e-4 
    --gradient_accumulation_steps 16 
    --max_grad_norm 0.5 
    --warmup_ratio 0.03 
    --eval_steps 100 
    --save_steps 100 
    --save_total_limit 2 
    --logging_steps 10 
    --use_flash_attn false 
    --self_cognition_sample 1000 
    --model_name 卡卡罗特 
    --model_author 陶白白 

The ms-bench dataset is a general knowledge dataset provided by Magic Adaptation, used for data mixing to prevent knowledge forgetting. The training loss convergence situation:

Qwen 1.5 Open Source! Best Practices for Magic Adaptation!

It can be seen that the convergence is very smooth.

Training memory usage:

Qwen 1.5 Open Source! Best Practices for Magic Adaptation!

The inference after training can use the following script (note to replace –ckpt_dir with the path to the training log output weights):

# Experimental environment: A100PYTHONPATH=../../.. 
CUDA_VISIBLE_DEVICES=0 
python llm_infer.py 
    --ckpt_dir "/xxx/xxx/Qwen1.5-7b-chat/vx-xxx/checkpoint-xx" 
    --load_dataset_config true 
    --max_length 2048 
    --eval_human true 
    --use_flash_attn false 
    --max_new_tokens 2048 
    --temperature 0.1 
    --top_p 0.7 
    --repetition_penalty 1. 
    --do_sample true 
    --merge_lora_and_save false 

Inference effect of the model after fine-tuning for self-cognition:

Qwen 1.5 Open Source! Best Practices for Magic Adaptation!

Model Deployment

Using vllm to Deploy the Open Source Version of Qwen 1.5 from Magic Adaptation Community

Set environment variable: export VLLM_USE_MODELSCOPE=True

vllm launches OpenAI server

python -m vllm.entrypoints.openai.api_server 
    --model qwen/Qwen1.5-7B-Chat --max-model-len 8192  --gpu-memory-utilization 0.95

Access the service

curl http://localhost:8000/v1/chat/completions 
    -H "Content-Type: application/json" 
    -d '{        "model": "qwen/Qwen1.5-7B-Chat",        "messages": [            {"role": "system", "content": "You are a helpful assistant."},            {"role": "user", "content": "Write an essay on the theme of spring."}        ],        "stop": ["<|im_end|>", "<|endoftext|>"]    }'

Using llama.cpp to Deploy the GGUF Version of Qwen 1.5

Download GGUF file:

from modelscope.hub.file_download import model_file_download
model_dir = model_file_download(model_id='qwen/Qwen1.5-1.8B-Chat-GGUF',file_path='qwen1.5-1_8b-chat-q8_0.gguf',revision='master',cache_dir='/mnt/workspace/')

Clone llama.cpp code and inference:

git clone https://github.com/ggerganov/llama.cpp.gitcd llama.cppmake -j && ./main -m /mnt/workspace/qwen/Qwen1.5-1.8B-Chat-GGUF/qwen1.5-1_8b-chat-q8_0.gguf -p "Building a website can be done in 10 simple steps:
Step 1:" -n 400 -e

Qwen 1.5 Open Source! Best Practices for Magic Adaptation!

Running Qwen 1.5 using Ollama

Install Ollama and run

curl https://ollama.ai/install.sh | shollama serve

Run Qwen directly

ollama run qwen

Inference using the llamafile version without installation (thanks to community user bingal for the contribution):

Link: https://modelscope.cn/models/bingal/Qwen1.5-7B-Chat-llamafile/summary

Model download:

from modelscope.hub.file_download import model_file_download
model_dir = model_file_download(model_id='bingal/Qwen1.5-7B-Chat-llamafile',file_path='qwen1.5-7b-chat-q5_k_m.llamafile',revision='master',cache_dir='/mnt/workspace/')

Run inference directly without environment installation:

chmod +x qwen1.5-7b-chat-q5_k_m.llamafile./qwen1.5-7b-chat-q5_k_m.llamafile

Supports OpenAI format API calls:

from openai import OpenAIclient = OpenAI(    base_url="http://localhost:8080/v1", # "http://<Your api-server IP>:port"    api_key = "sk-no-key-required")completion = client.chat.completions.create(    model="LLaMA_CPP",    messages=[        {"role": "system", "content": "You are a helpful AI assistant."},        {"role": "user", "content": "Hello"}    ])print(completion.choices[0].message)

Qwen 1.5 Open Source! Best Practices for Magic Adaptation!

Click Read Original for direct model experience link.

Leave a Comment