How Effective Is Tongyi Qwen-7B? Firefly Fine-Tuning Practice Shows Great Results

Introduction

On August 3, Alibaba Cloud released its first open-source large model: Tongyi Qwen-7B, which is open-source and commercially usable. Although everyone has been raised expectations with various hundred-billion parameter models, the fact that it is produced by Alibaba has attracted widespread attention and discussion among peers, and it has performed excellently on various large model rankings.

🎉 The Firefly project has added training for Tongyi Qwen-7B. We ran the training process with over thirty thousand multi-turn dialogue data, initially intending to verify the effectiveness of the process, but later found the results to be very nice, see Section 3 for results. One can only marvel at the power of 2.2 trillion pre-training tokens, which indeed brings miraculous results.

In the future, we will fine-tune Qwen-7B with more data, around one million entries, to validate its effectiveness.

This article will mainly introduce some relevant information about Tongyi Qwen-7B and the results after our simple fine-tuning.

Firefly project link:

https://github.com/yangjianxin1/Firefly

Tongyi Qwen-7B project link:

https://github.com/QwenLM/Qwen-7B

Tongyi Qwen-7B

This chapter mainly draws from the official introduction of Tongyi Qwen-7B, aiming to give everyone an overall understanding of Tongyi Qwen-7B.

Tongyi Qwen-7B is a model developed by Alibaba Cloud with a scale of 7 billion parameters. Qwen-7B is a large language model based on the Transformer architecture, trained on a massive scale of pre-training data. The pre-training data types are diverse and extensive, including a large amount of web text, professional books, code, etc. Additionally, based on Qwen-7B, an AI assistant called Qwen-7B-Chat was created using alignment mechanisms.

Tongyi Qwen-7B has the following main features:

Large-scale high-quality training corpus: Pre-trained using over 2.2 trillion tokens of data, which includes high-quality Chinese, English, multilingual, code, mathematics, etc., covering general and specialized training materials. The distribution of pre-training corpus has been optimized through numerous comparative experiments.
Powerful performance: Qwen-7B significantly outperforms existing open-source models of similar scale in various Chinese and English downstream evaluation tasks (including common sense reasoning, code, mathematics, translation, etc.), and even shows strong competitiveness in some metrics compared to larger models. For specific evaluation results, please refer to the following.
More comprehensive vocabulary coverage: Compared to current open-source models primarily using Chinese and English vocabularies, Qwen-7B utilizes a vocabulary size of about 150,000. This vocabulary is more friendly for multilingual use, allowing users to enhance and expand capabilities for certain languages without extending the vocabulary.

The basic situation of the Qwen-7B model scale is as follows:

How Effective Is Tongyi Qwen-7B? Firefly Fine-Tuning Practice Shows Great Results

In terms of position encoding, FFN activation functions, and normalization implementations, the most popular methods are adopted, namely RoPE relative position encoding, SwiGLU activation function, and RMSNorm (optional installation of flash-attention acceleration).

Regarding the tokenizer, compared to mainstream open-source models that primarily use Chinese and English vocabularies, Qwen-7B uses a vocabulary size of over 150,000 tokens. This vocabulary is optimized for Chinese and multilingual on the basis of the BPE vocabulary cl100k_base used by GPT-4, facilitating efficient encoding and decoding of Chinese, English, and code data, and is more friendly for certain multilingual users, allowing capability enhancement without expanding the vocabulary. The vocabulary splits numbers by individual digits. It uses the efficient tiktoken tokenizer library for tokenization.

From a random sample of 1 million documents from various languages, we compared the encoding compression rates of different models (with XLM-R supporting 100 languages as the baseline value of 1, the lower the better), specific performance is shown in the figure.

It can be seen that Qwen-7B achieves high compression rates for certain widely used languages (Thai, Hebrew, Arabic, Korean, Vietnamese, Japanese, Turkish, Indonesian, Polish, Russian, Dutch, Portuguese, Italian, German, Spanish, French, etc.) while maintaining efficient decoding of Chinese and English code, making the model highly scalable and efficient for training and inference in these languages.

In terms of pre-training data, the Qwen-7B model utilizes some open-source general corpora, while also accumulating a massive amount of web data and high-quality text content, with deduplication and filtering resulting in over 2.2 trillion tokens. This encompasses web text, encyclopedias, books, code, mathematics, and various specialized fields.

Qwen-7B has surpassed the performance of language models of similar scale in multiple comprehensive evaluations of natural language understanding and generation, mathematical problem-solving, code generation, and other capabilities, including MMLU, C-Eval, GSM8K, HumanEval, WMT22, etc., even exceeding larger language models with 12-13 billion parameters.

How Effective Is Tongyi Qwen-7B? Firefly Fine-Tuning Practice Shows Great Results

Fine-Tuning Qwen-7B

We used the Firefly project to fine-tune Qwen-7B, noting that this is a fine-tuning of the pre-trained model, not the chat model. As per convention, we used the moss data shared in the project, along with a sample of 20,000 school math data, combining to about 1 million data points, currently trained for 2000 steps on a single card, consuming about 38,000 data points.

Follow the tutorial to set up the environment and prepare the data: A Careful Tutorial for Fine-Tuning Baichuan-13B, guiding you step by step to train a hundred-billion model.Simply execute the following script.

torchrun --nproc_per_node={num_gpus} train_qlora.py --train_args_file train_args/qlora/qwen-7b-sft-qlora.json

The trend of training loss is shown in the figure below.

How Effective Is Tongyi Qwen-7B? Firefly Fine-Tuning Practice Shows Great Results

Next, let’s look at some of the generation effects of the Firefly fine-tuned model. For better reading effects and more diverse generation samples, please check the shared document link at the end.

Multi-Turn Dialogue

The fine-tuned Qwen-7B has good multi-turn dialogue capabilities, with decent context understanding and reference disambiguation. Although it may still produce common sense errors, which is a common problem for large models, the hallucination issue is difficult to eliminate, and it also appears in hundred-billion models. Moreover, the model can self-correct based on user feedback and ultimately provide the correct answer.

How Effective Is Tongyi Qwen-7B? Firefly Fine-Tuning Practice Shows Great Results

Mathematical Problems

Although the model has only 7 billion parameters, its mathematical reasoning capability is quite good after fine-tuning, showing good performance on common addition and subtraction application problems. However, it still encounters issues like the first question below, where the reasoning process is correct but the final addition is incorrect.

Feedback from group members indicates that for the first question, official chat models like Baichuan-13B, ChatGLM2, Qwen-7B, and Bard all provided incorrect answers. Only ChatGPT 3.5, Claude2, iFlytek Spark, Wenxin Yiyan, ChatGLM Pro, etc., provided correct answers.

How Effective Is Tongyi Qwen-7B? Firefly Fine-Tuning Practice Shows Great Results

Other Examples

How Effective Is Tongyi Qwen-7B? Firefly Fine-Tuning Practice Shows Great Results

For more generation examples, please scan the QR code or open the document link to view the shared document content:

https://docs.qq.com/sheet/DU0V0blJITXp2WEJq?tab=c5vlid

How Effective Is Tongyi Qwen-7B? Firefly Fine-Tuning Practice Shows Great Results

Conclusion

This article mainly introduced Tongyi Qwen-7B and shared the results of the Firefly fine-tuning on the Qwen-7B model. Overall, Tongyi Qwen-7B is an excellent Chinese foundational model, with 2.2 trillion pre-training tokens, making it the current model with the largest amount of pre-training data.

Since Llama and Falcon, large models have stopped focusing solely on parameter counts and have started stacking more pre-training data. Falcon’s 1 trillion tokens, Baichuan-13B’s 1.4 trillion tokens, Llama2’s 2 trillion tokens, and Qwen-7B’s 2.2 trillion tokens, indicate that the main barrier to enhancing large models is no longer computational power, but data.

The last major open-source event for Chinese foundational large models dates back to early July with Baichuan-13B. This time, Alibaba’s open-source initiative has revitalized the Chinese open-source community once again. Finally, we look forward to Qwen’s hundred-billion model, as 2.2 trillion tokens are already sufficient to train a hundred-billion model adequately. Llama2 refreshed the ceiling for open-source models with 2 trillion tokens; we hope Qwen’s hundred-billion model can also raise the ceiling for Chinese open-source models.

To join the technical exchange group, please add the AINLP assistant WeChat (id: ainlp2)
Please note your specific direction + relevant technical points.

About AINLP
AINLP is an interesting AI natural language processing community focused on sharing technologies related to AI, NLP, machine learning, deep learning, recommendation algorithms, etc. Topics include LLM, pre-training models, automatic generation, text summarization, intelligent Q&A, chatbots, machine translation, knowledge graphs, recommendation systems, computational advertising, recruitment information, and job experience sharing. Welcome to follow! To join the technical exchange group, please add the AINLP assistant WeChat (id: ainlp2), noting your work/research direction + purpose for joining the group.

Thank you for reading, please share, like, or leave a comment 🙏

Leave a Comment Cancel reply