In recent months, the Tongyi Qianwen team has been working hard to explore how to build a ‘good’ model while optimizing the developer experience. Just before the Chinese New Year, the Tongyi Qianwen team shared the next version of the Qwen open-source series, Qwen 1.5.
Qwen 1.5 has open-sourced six sizes of foundational and chat models, including 0.5B, 1.8B, 4B, 7B, 14B, and 72B, as well as quantized models. It not only provides Int4 and Int8 GPTQ models but also AWQ models and GGUF quantized models. To enhance the developer experience, the code for Qwen 1.5 has been merged into Hugging Face Transformers, allowing developers to use transformers>=4.37.0 directly without trust_remote_code. Additionally, Qwen 1.5 supports frameworks like vLLM, SGLang, and AutoGPTQ.
Compared to previous versions, Qwen 1.5 significantly improves the consistency of chat models with human preferences and enhances their multilingual capabilities. All models provide unified context length support, supporting 32K contexts. Moreover, the quality of the foundational language model has also seen minor improvements.
1. A More Comprehensive Model Series: Providing 6 different size models, as well as GPTQ/AWQ/GGUF quantized versions, there’s definitely one that suits your needs.
2. Better Ecosystem Integration: Integrates with Hugging Face Transformers and mainstream third-party deployment, quantization, fine-tuning, and service frameworks, making it convenient for everyone.
3. More Powerful Performance: Chat model performance has significantly improved, achieving excellent performance even on English MT-Bench with the Qwen 1.5-Chat series.
4. More Comprehensive and Unified Features: The entire series uniformly supports at least 32k maximum length, with comprehensive improvements in multilingual capabilities and richer multilingual evaluations. The entire series uniformly supports system prompts and possesses powerful external system linking capabilities (agent/RAG/Tool-use/Code-interpreter).
Experience Address:
https://modelscope.cn/studios/qwen/Qwen1.5-72B-Chat-Demo/summary
For example, multilingual capabilities:
Role-playing:
Tool invocation capabilities:
Model Link: https://modelscope.cn/organization/qwen
from modelscope import snapshot_downloadmodel_dir = snapshot_download('qwen/Qwen1.5-7B-Chat')
Environment Dependencies:
pip install transformers>=4.37.0
Inference Code:
from modelscope import AutoModelForCausalLM, AutoTokenizerdevice = "cuda" # the device to load the model onto
model = AutoModelForCausalLM.from_pretrained( "qwen/Qwen1.5-0.5B-Chat", device_map="auto")tokenizer = AutoTokenizer.from_pretrained("qwen/Qwen1.5-7B-Chat")
prompt = "Give me a short introduction to large language model."messages = [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": prompt}]text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True)model_inputs = tokenizer([text], return_tensors="pt").to(device)
generated_ids = model.generate( model_inputs.input_ids, max_new_tokens=512)generated_ids = [ output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
The SWIFT fine-tuning framework from the Magic Adaptation community (https://github.com/modelscope/swift) has supported fine-tuning and inference for the entire series of Qwen 1.5 models.
Below, we provide the training parameter configuration for the self-cognition task using the Qwen 1.5-7B-Chat model as an example:
# Experimental environment: A100# 30GB GPU memoryPYTHONPATH=../../..
CUDA_VISIBLE_DEVICES=0
python llm_sft.py
--model_type qwen1half-7b-chat
--sft_type lora
--tuner_backend swift
--dtype AUTO
--output_dir output
--dataset ms-bench
--train_dataset_sample 5000
--num_train_epochs 2
--max_length 1024
--check_dataset_strategy warning
--lora_rank 8
--lora_alpha 32
--lora_dropout_p 0.05
--lora_target_modules ALL
--gradient_checkpointing true
--batch_size 1
--weight_decay 0.01
--learning_rate 1e-4
--gradient_accumulation_steps 16
--max_grad_norm 0.5
--warmup_ratio 0.03
--eval_steps 100
--save_steps 100
--save_total_limit 2
--logging_steps 10
--use_flash_attn false
--self_cognition_sample 1000
--model_name 卡卡罗特
--model_author 陶白白
The ms-bench dataset is a general knowledge dataset provided by Magic Adaptation, used for data mixing to prevent knowledge forgetting. The training loss convergence situation:
It can be seen that the convergence is very smooth.
Training memory usage:
The inference after training can use the following script (note to replace –ckpt_dir with the path to the training log output weights):
# Experimental environment: A100PYTHONPATH=../../..
CUDA_VISIBLE_DEVICES=0
python llm_infer.py
--ckpt_dir "/xxx/xxx/Qwen1.5-7b-chat/vx-xxx/checkpoint-xx"
--load_dataset_config true
--max_length 2048
--eval_human true
--use_flash_attn false
--max_new_tokens 2048
--temperature 0.1
--top_p 0.7
--repetition_penalty 1.
--do_sample true
--merge_lora_and_save false
Inference effect of the model after fine-tuning for self-cognition:
Using vllm to Deploy the Open Source Version of Qwen 1.5 from Magic Adaptation Community
Set environment variable: export VLLM_USE_MODELSCOPE=True
vllm launches OpenAI server
python -m vllm.entrypoints.openai.api_server
--model qwen/Qwen1.5-7B-Chat --max-model-len 8192 --gpu-memory-utilization 0.95
Access the service
curl http://localhost:8000/v1/chat/completions
-H "Content-Type: application/json"
-d '{ "model": "qwen/Qwen1.5-7B-Chat", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Write an essay on the theme of spring."} ], "stop": ["<|im_end|>", "<|endoftext|>"] }'
Using llama.cpp to Deploy the GGUF Version of Qwen 1.5
Download GGUF file:
from modelscope.hub.file_download import model_file_download
model_dir = model_file_download(model_id='qwen/Qwen1.5-1.8B-Chat-GGUF',file_path='qwen1.5-1_8b-chat-q8_0.gguf',revision='master',cache_dir='/mnt/workspace/')
Clone llama.cpp code and inference:
git clone https://github.com/ggerganov/llama.cpp.gitcd llama.cppmake -j && ./main -m /mnt/workspace/qwen/Qwen1.5-1.8B-Chat-GGUF/qwen1.5-1_8b-chat-q8_0.gguf -p "Building a website can be done in 10 simple steps:
Step 1:" -n 400 -e
Running Qwen 1.5 using Ollama
Install Ollama and run
curl https://ollama.ai/install.sh | shollama serve
Run Qwen directly
ollama run qwen
Inference using the llamafile version without installation (thanks to community user bingal for the contribution):
Link: https://modelscope.cn/models/bingal/Qwen1.5-7B-Chat-llamafile/summary
Model download:
from modelscope.hub.file_download import model_file_download
model_dir = model_file_download(model_id='bingal/Qwen1.5-7B-Chat-llamafile',file_path='qwen1.5-7b-chat-q5_k_m.llamafile',revision='master',cache_dir='/mnt/workspace/')
Run inference directly without environment installation:
chmod +x qwen1.5-7b-chat-q5_k_m.llamafile./qwen1.5-7b-chat-q5_k_m.llamafile
Supports OpenAI format API calls:
from openai import OpenAIclient = OpenAI( base_url="http://localhost:8080/v1", # "http://<Your api-server IP>:port" api_key = "sk-no-key-required")completion = client.chat.completions.create( model="LLaMA_CPP", messages=[ {"role": "system", "content": "You are a helpful AI assistant."}, {"role": "user", "content": "Hello"} ])print(completion.choices[0].message)
Click Read Original for direct model experience link.