LLaMA Factory Fine-Tuning Guide

About LLaMA Factory

In today’s rapidly advancing field of artificial intelligence, how to efficiently fine-tune and deploy large language models (LLMs) has become a hot topic in research and application. LLaMA Factory, as an open-source fine-tuning framework, has emerged in this context. It aims to provide developers with a simple and efficient tool to quickly adapt to specific task requirements based on existing pre-trained models, enhancing model performance.

LLaMA Factory supports various popular language models, such as LLaMA, BLOOM, Mistral, and Baichuan, covering a wide range of application scenarios. From academic research to enterprise applications, LLaMA Factory demonstrates its powerful adaptability and flexibility. Additionally, LLaMA Factory comes equipped with a user-friendly LlamaBoard web interface, lowering the usage barrier, allowing even users without a strong programming background to easily perform model fine-tuning and inference operations.

The emergence of LLaMA Factory not only saves developers a significant amount of time and resources but also promotes the popularization and application of AI technology. Through it, more people can participate in the customization and optimization of AI models, driving innovation and development across the industry.

This article will explore the LLaMA Factory framework and deploy the fine-tuned model using Ollama.

Environment Setup

Training large models requires a GPU. For learning purposes, I chose not to use the Alibaba Cloud PAI platform, but instead directly created an ECS instance to manually replicate the entire process.

I plan to train a 7B model, and after comparison, I selected a machine with the following configuration:

CPU: 8
GPU: 1
GPU Type: NVIDIA A10
GPU Memory: 24 GB
Memory: 30 GB
Bandwidth: 16.00 Gbps
System Disk Size: 500 GB

We chose Ubuntu as the operating system, which comes with the GPU driver pre-installed.

We need to wait a moment; we can only connect to the ECS after the GPU driver is installed.

By entering nvidia-smi, we can see our graphics card.

Refer to the official installation steps:

https://github.com/hiyouga/LLaMA-Factory/blob/main/README_zh.md

git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
pip install -e ".[torch,metrics]"

After installation, enter the following command to open the visual interface:

llamafactory-cli webui

Fine-tuning requires data. Here I found a public dataset from an Alibaba Cloud tutorial:

https://help.aliyun.com/zh/pai/use-cases/fine-tune-a-llama-3-model-with-llama-factory

cd LLaMA-Factory
wget https://atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com/release/tutorials/llama_factory/data.zip
mv data rawdata && unzip data.zip -d data

The content mainly consists of role-playing and multi-turn dialogues:

[
{
    "conversations": [
      {
        "from": "human",
        "value": "Imitate the tone of Sun Wukong from Journey to the West and chat with me. What do you get if you freeze water?"
      },
      {
        "from": "gpt",
        "value": "Haha, is this question difficult for me? Freezing water means turning it into ice! But if it’s needed on the journey to get the scriptures, then it’s a good thing to quench thirst and cool off when passing through the Flame Mountain! What do you think? Do you have any more interesting questions for me to answer?"
      },
      {
        "from": "human",
        "value": "How can I freeze water?"
      },
      {
        "from": "gpt",
        "value": "Haha, for me, this is a piece of cake! Just put the water in a cool place or use magic, and with a puff of fairy energy, the water will quickly freeze! Of course, in your world, you can just put the water in the freezer for a while, and it will turn into ice cubes. If you think that’s slow, you can try using some dry ice or liquid nitrogen, but be careful! After all, I don’t have the cold resistance magic like me!"
      }
    ]
}
]

Fine-Tuning Parameter Configuration

Having a visual interface is very convenient; you can directly select the base model from the dropdown menu, and it will automatically pull and download.

Select our fine-tuning data, preview the dataset, and it can be seen that it is recognized correctly.

Fine-tuning involves many parameter settings, and LLaMA Factory has already built in some recommended default values.

There are many parameters involved here, and I will introduce some of the more important ones.

Fine-Tuning Methods

The built-in fine-tuning methods are as follows:

• full: Full parameter fine-tuning, tuning the entire model, which requires a large amount of GPU memory.
• freeze: Freeze fine-tuning, freezing most of the model parameters and only fine-tuning a part, which can reduce GPU memory requirements.
• lora: Freezing part of the model parameters, fine-tuning only a portion but only on specific layers, greatly saving memory.

By default, it is set to lora because using the lora lightweight fine-tuning method can save memory significantly.

Learning Rate

The learning rate is one of the most important hyperparameters, determining the magnitude of change in parameters during each update. A learning rate that is too large may lead to instability in model training, while one that is too small will slow down the training process. During fine-tuning, a smaller learning rate is usually used compared to the pre-training phase, as we want the changes in model parameters to be more subtle to avoid destroying the useful information already learned.

Regarding the learning rate, I saw a more vivid explanation: the learning rate is like teaching a child new knowledge; if you give them too much information at once, they may feel confused and unable to absorb it. But if you give too little information, the learning progress will be very slow.

How much should the learning rate be set to be appropriate? After looking through several articles, the summary is: it needs to be tried.

It is recommended to start with a conservative small value and then adjust based on feedback during the training process. Using learning rate scheduling and adaptive optimizers can help automatically adjust the learning rate during training, improving model performance.

Common learning rate parameters include but are not limited to:

• 1e-1 (0.1): A relatively large learning rate, used for rapid exploration in the early stages.
• 1e-2 (0.01): A medium-sized learning rate, commonly used as the initial learning rate for many standard models.
• 1e-3 (0.001): A smaller learning rate, suitable for fine-tuning when close to the optimization target.
• 1e-4 (0.0001): An even smaller learning rate, used for fine-tuning when the model is close to convergence.
• 5e-5 (0.00005): A very small learning rate, commonly used in the fine-tuning phase of pre-trained models, such as fine-tuning BERT in natural language processing.

Here are some suggestions from large models:

• Rapid exploration: In the early stages of model training or when you are unsure of the best parameters, you can use a larger learning rate (e.g., 0.1 or 0.01) to quickly find a reasonable solution.
• Fine-tuning: When you find that the model’s performance starts to stabilize but still needs further optimization, you can reduce the learning rate (e.g., 0.001 or 0.0001) to help the model find the optimal solution more precisely.
• Fine-tuning pre-trained models: When using already pre-trained models (e.g., fine-tuning BERT on specific tasks), a very small learning rate (e.g., 5e-5 or smaller) is usually used because the pre-trained model is already very close to the optimization target, and we only need to make slight adjustments.

Cutoff Length

Cutoff Length refers to the truncation length of training sentences. The longer the sentence, the more memory it occupies. If memory is insufficient, consider lowering it to 512 or even 256. It can be set according to the length required for the fine-tuning target. After fine-tuning, the model’s ability to handle sentences longer than the Cutoff Length will decrease.

Computational Type

The computational precision of large models refers to the degree of precision of the numerical representation used for model parameters and operations during training and inference.

Here’s some background knowledge: float and double data types are stored in memory in binary form, consisting of three parts:

• Sign bit S: 0 represents positive numbers, 1 represents negative numbers.
• Exponent bit E: Used to store the exponent part in scientific notation, determining the range of data.
• Mantissa bit M: Used to store the mantissa (decimal) part, determining the precision of the data.

Below is a storage example of FP16 and FP32 (float):

Initially, GPUs defaulted to using FP32 type for calculations, but as models grew larger, the issues of FP32 type occupying a lot of memory/GPU resources and slow computation speed gradually became apparent. To reduce model size and allow larger (more parameters) models to run on fixed memory GPUs, as well as to enhance training and inference speed, various low-precision data types have been proposed.

The main common precisions include:

1. FP32 (single precision float):

• Uses 32 bits to represent a floating-point number.
• High precision, but large computational and storage overhead.

2. FP16 (half precision float):

• Uses 16 bits to represent a floating-point number.
• Compared to FP32, it can reduce memory usage and improve computation speed.
• However, it may lead to some precision loss.

3. BF16 (Brain Floating Point):

• A 16-bit floating-point format between FP32 and FP16.
• Retains the exponent bits of FP32 but reduces the mantissa bits.
• Improves computational efficiency while maintaining good precision.

4. INT8 (8-bit integer quantization):

• Maps floating-point numbers to 256 integer values.
• Greatly reduces model size and computational load.
• However, precision loss is significant and requires special quantization training.

5. Mixed precision:

• Uses different precisions during training or inference.
• For example, FP16 is used for most computations, while FP32 is used for critical operations.
• Balances precision and efficiency.

Lower precision can improve computation speed and reduce memory usage, but it may affect the accuracy of the model. Therefore, in practical applications, it is necessary to choose the appropriate precision based on specific needs and hardware conditions.

Training Epochs

The number of training epochs, also known as epochs, is an important parameter in the model training process. It indicates the number of complete passes the model makes over the training set. For example, if we have a training set and our model needs to learn all the data in this training set, then one epoch refers to the model making a complete pass over this training set.

There seems to be no definitive good method; you can only adjust and compare several times to see the effect. Here are some suggestions from large models:

1. Generally start with a smaller number of epochs, such as 3-5 epochs, and observe model performance.
2. Adjust based on performance on the validation set. If performance is still improving, you can increase the number of epochs; if overfitting starts, you should reduce the number of epochs.
3. For large-scale pre-trained models, often only a small number of epochs are needed to achieve good results, typically no more than 10 epochs.
4. Use early stopping strategies to automatically stop training when performance on the validation set no longer improves.
5. The appropriate number of epochs may vary depending on the size of the dataset:

• Small datasets (<10,000 samples): may require 10-20 epochs.
• Medium datasets (10,000-100,000 samples): about 5-10 epochs.
• Large datasets (>100,000 samples): 3-5 epochs may be sufficient.

6. Using learning rate decay strategies can achieve good results in fewer epochs.

7. Different task types may require different numbers of epochs, and specific tasks should be analyzed individually.

8. You can try using a larger learning rate with fewer epochs.

9. Monitor the loss changes during training as a reference for adjusting the number of epochs.

In summary, the best number of epochs needs to be determined through experimentation, somewhat like a mystical or magical process, as the specific settings of parameters are full of uncertainty, with no fixed standards.

Gradient Accumulation

Gradient Accumulation’s basic idea is to change the computation of gradients for a single batch parameter update into a step-by-step process. Specifically, this method performs model forward propagation and backward propagation in small batches, iteratively computing and accumulating gradients from multiple small batches, and when enough gradients are accumulated, the model optimization step is executed to update parameters. This is also a typical trade-off of time for space, allowing us to update a large number of parameters on limited GPU memory, although the additional forward and backward propagations of small batches will slow down training a bit.

For example, if the target batch size is 1,024, but the device can only handle 256 samples at a time, we can accumulate gradients from four steps of 256 samples each to simulate a batch update of 1,024 samples.

This method balances the need for large batches under limited memory resources, helping achieve more stable gradient estimates and potentially faster convergence speeds.

LoRA Rank

LoRA (Low-Rank Approximation) is a method used for fine-tuning large models, which reduces the computational and storage costs by lowering the rank of the model parameter matrix. When fine-tuning large models, a significant amount of computational resources and storage space is often required, and LoRA can significantly reduce these requirements by lowering the rank of the model parameter matrix.

Specifically, LoRA employs matrix decomposition methods to decompose the model parameter matrix into the product of two lower-rank matrices. The advantage of this approach is that it can approximate the original parameter matrix with lower-rank matrices, thus reducing the model’s complexity and storage requirements.

The rank of LoRA can be set according to the needs of the model. Generally, the lower the rank, the lower the model’s complexity, but performance may be affected to some extent. Therefore, when fine-tuning large models, it is necessary to choose an appropriate rank size based on specific circumstances to balance model performance and resource usage.

It is recommended to choose based on hardware conditions, generally selecting 16 or 32, which yields better fine-tuning results.

LoRA Scaling Factor

The scaling factor is a parameter used to indicate the relative importance of each layer in the model. In LoRA, each layer has a scaling factor that adjusts that layer’s contribution to the overall loss function. A higher scaling factor indicates that the layer has greater weight, while a lower scaling factor indicates less weight.

The selection of scaling factors can be adjusted based on the characteristics and requirements of the problem. Typically, the scaling factors for lower layers can be set to smaller values to retain more of the original feature information, while scaling factors for higher layers can be set to larger values to emphasize higher-level abstract features.

LoRA + Learning Rate Ratio

Click LoRA Parameter Settings to set the LoRA + learning rate ratio to 16, which has been proven to be a more effective algorithm than LoRA alone.

LoRA Action Module

In the LoRA action module, fill in all to mount the LoRA layers on all linear layers of the model to enhance fitting effects.

Start Training

Initially, I encountered two problems. The first was that I initially selected llama3.1, but llama requires login. Since we are just learning, we have no requirements for the model, so I switched to Qwen2, which can be downloaded without logging in.

The final parameter configuration is as follows:

Click to start, let the training begin.

Check the GPU resource status; 20G out of 23G memory is used.

The console shows the progress bar.

Effect Evaluation

After waiting for several minutes, we were informed that training was complete.

Switch to the Evaluate & Predict tab, select our eval dataset, and start evaluation.

The evaluation results are as follows:

Are these results good or bad? Let’s ask the large model.

These evaluation results provide multiple aspects of the model's performance, let’s analyze them one by one:
1. BLEU-4 Score: 12.2547
  ○ BLEU (Bilingual Evaluation Understudy) is a metric for evaluating the quality of machine translation.
  ○ BLEU-4 considers matching of up to 4-grams.
  ○ The score typically ranges from 0-100; your score of 12.25 is relatively low.
  ○ This indicates that the generated text does not match the reference text very well, and may need further improvement.
2. ROUGE Score:
  ○ ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is used to assess the quality of automatic summaries.
  ○ ROUGE-1: 32.5636 (match rate of individual words)
  ○ ROUGE-2: 12.1632 (match rate of two-word sequences)
  ○ ROUGE-L: 20.9140 (Longest Common Subsequence)
  ○ These scores indicate that the model performs well at the word level but has room for improvement in capturing longer phrase structures.
3. Performance Metrics:
  ○ Model preparation time: 0.0035 seconds (very fast)
  ○ Prediction runtime: 291.2128 seconds (about 4.85 minutes)
  ○ Samples processed per second: 0.103
  ○ Steps per second: 0.052
4. These metrics show that the model loads quickly, but the inference speed is relatively slow. This may be due to the model being large or hardware limitations.

Switch to the Chat tab, we can load the fine-tuned model for conversation and see if it meets expectations.

Imitating Zhuge Liang and asking what to do if my laptop is lost?

Imitating Sun Wukong and asking what to do if my laptop is lost?

We can compare the effects with the original base model:

Exporting the Model

After training is complete, we can export the model for publishing on HF or for local deployment.

The directory structure is as follows, approximately 15G in total:

1. added_tokens.json: includes custom tokens added to the vocabulary.
2. merges.txt: merge rules for byte pair encoding (BPE).
3. model-00001-of-00004.safetensors to model-00004-of-00004.safetensors: these are the shard files of the model weights, stored in safetensors format.
4. special_tokens_map.json: mapping of special tokens (such as [PAD], [CLS], [SEP], etc.).
5. vocab.json: the vocabulary of the model.
6. config.json: the configuration file of the model, containing architecture and hyperparameter information.
7. generation_config.json: configuration file for text generation tasks.
8. model.safetensors.index.json: index of the safetensors files.
9. tokenizer_config.json: configuration file for the tokenizer.
10. tokenizer.json: contains complete information about the tokenizer, used to convert text into tokens.

Deploying the Model with Ollama

Installation

Ollama is an open-source large language model service tool designed to help users quickly run large models locally. With simple installation commands, users can easily start and run open-source large language models with a single command. It offers a clean and user-friendly command-line interface and server, specifically designed for building large language model applications. Users can easily download, run, and manage various open-source LLMs.

Installation command:

curl -fsSL https://ollama.com/install.sh | sh
git clone --depth=1 https://github.com/ggerganov/llama.cpp
pip3 install -r requirements.txt

Format Comparison

Ollama supports various large model formats, primarily Safetensors and GGUF:

Safetensors is a file format developed by Hugging Face, specifically designed for storing machine learning model weights. It emphasizes security, fast loading, and memory efficiency, supporting memory mapping and partial loading, suitable for large models. Safetensors is widely used in the Hugging Face ecosystem, particularly in scenarios requiring rapid loading and cross-platform compatibility, and the default format we export is Safetensors.

GGUF is an improved version of the GGML format, primarily used for the llama.cpp project, created for efficiently running large language models. It supports various quantization methods, optimizes inference performance on CPUs and GPUs, and can store additional model information. GGUF is particularly suitable for running large language models in resource-constrained environments, such as edge computing and mobile devices, with a design focus on inference efficiency and quantization support.

We can convert Safetensors models to GGUF format through the project https://github.com/ggerganov/llama.cpp

python3 convert_hf_to_gguf.py /root/Qwen2.5-7B-SFT --outfile /root/Qwen2.5-7B-SFT.gguf

FROM /root/Qwen2.5-7B-SFT.gguf # Load the gguf format model

FROM /root/Qwen2.5-7B-SFT/ # Load the Safetensors format model directory

Comparative tests show that models in gguf format load much faster.

Artificial Intelligence Issues

Everything went smoothly until I encountered a frustrating problem when deploying the fine-tuned model: Why did I train an artificial intelligence idiot?

After consulting the documentation, I learned that Ollama requires creating a Modelfile to load local models, with the following content:

FROM /root/Qwen2.5-7B-SFT.gguf

Then load it:

ollama create my-qwen -f /root/Modelfile

Run the model:

ollama run my-qwen

Then something strange happened: When I input “Hello”, the large model told me it has had diabetes for 10 years.

After restarting once, the answers were still nonsensical.

After further research, I found that loading my fine-tuned model requires configuring parameters in the modelfile:

https://github.com/ollama/ollama/blob/main/docs/modelfile.md#format

After modifying according to the documentation, it looks like this:

FROM /root/Qwen2.5-7B-SFT.gguf
PARAMETER temperature 0.7
PARAMETER stop "&lt;|im_start|&gt;"
PARAMETER stop "&lt;|im_end|&gt;"
TEMPLATE """
&lt;|im_start|&gt;system
{{ .System }}&lt;|im_end|&gt;
&lt;|im_start|&gt;user
{{ .Prompt }}&lt;|im_end|&gt;
&lt;|im_start|&gt;assistant
"""
SYSTEM """You are a helpful assistant."""

It started off looking pretty good.

However, after asking a few more questions, I found that each answer would always include some extra content at the end.

I tried several templates provided in other blogs, but still encountered the same issue.

Here are three suspected points:

1. Template issue: After adding the template from the official documentation, the absurdity of the responses decreased. So I suspect whether the template is still not configured correctly? But I copied it directly from the official documentation?
2. Model issue: Another possibility is that the quality of my fine-tuning dataset is low, leading the model to learn some strange things.
3. Quantization issue: There may be precision loss when converting from HF format to GGUF. However, oddly enough, I didn’t modify the quantization parameters during conversion, did I?

To rule out model issues, I conducted a comparative experiment by loading the fine-tuned model in LLaMA Factory and repeatedly asking questions, and there were no issues with extra characters at the end of the responses.

To rule out issue 3, I continued to control variables for comparison, not using the gguf format, directly using the original Hugging Face Safetensors format, and removing the template.

FROM /root/Qwen2.5-7B-SFT/
SYSTEM """You are a helpful assistant."""

As can be seen, the content of the responses began to become absurd again.

LLaMA Factory Fine-Tuning Guide

So it can be basically confirmed that it is a template configuration issue, but I followed the configuration from the official documentation?

After some searching, I later found another configuration in another document:

https://ollama.com/library/qwen2/blobs/62fbfd9ed093

After adjusting and running, after repeated tests, there was no longer any occurrence of extra content at the end of the responses; finally, it was no longer an artificial intelligence idiot.

LLaMA Factory Fine-Tuning Guide

So where did the problem lie?

Upon visually diffing, I found that the template provided in the official documentation is missing a closing im_end tag.

1. According to common sense, im_start and im_end markers should appear in pairs.
2. When they do not appear in pairs, the model outputs extra information at the end.

Is it the Ollama documentation that is wrong?

But curiously, later I checked the ChatML example provided by Microsoft, which also has no closing tag:

https://learn.microsoft.com/zh-cn/azure/ai-services/openai/how-to/chat-markup-language

In the description of templates by Hugging Face, another parameter is mentioned: add_generation_prompt. This parameter seems to determine whether to add the final closing tag.

https://huggingface.co/docs/transformers/zh/chat_templating

So whether to add the final closing tag depends on the specific model? The documentation does not clearly state this, which indeed puzzles me. If anyone knows the reason, please let me know, thank you.

In summary, the final working configuration template parameters in our scenario are as follows:

FROM /root/Qwen2.5-7B-SFT.gguf
PARAMETER temperature 0.7
PARAMETER top_p 0.7
PARAMETER stop "&lt;|im_start|&gt;"
PARAMETER stop "&lt;|im_end|&gt;"
TEMPLATE """
{{ if .System }}&lt;|im_start|&gt;system
{{ .System }}&lt;|im_end|&gt;
{{ end }}{{ if .Prompt }}&lt;|im_start|&gt;user
{{ .Prompt }}&lt;|im_end|&gt;
{{ end }}&lt;|im_start|&gt;assistant
{{ .Response }}&lt;|im_end|&gt;
"""
SYSTEM """You are a helpful assistant."""

Check the GPU usage; it occupies about 14G of memory, and the inference memory required is less than that for fine-tuning.

A Little Side Story

Using LLaMA Factory to load the SFT model for comparison experiments, I found that token generation was particularly slow.

To my surprise, the memory was full, and the GPU was not running. However, the day before, during fine-tuning training, it was still able to utilize the GPU.

I discovered that my torch had inexplicably turned into the CPU version.

pip3 install torch torchvision torchaudio

At this point, CUDA could be recognized normally.

Conclusion

This article introduced the steps for fine-tuning using LLaMA Factory, including environment setup, data preparation, parameter configuration, training, and effect evaluation, ultimately successfully fine-tuning the model and deploying it using Ollama, enhancing model performance and achieving the expected results.

One observation is that it differs from the security experiments I previously encountered: most security experiments yield certain results when I execute a payload, whether it’s popping a calculator or bouncing a shell; everything is deterministic. In contrast, the training of large models often involves mystical elements, which may require multiple experiments to determine the optimal parameters.

Reference Links

https://opencsg.com/docs/llama-factory-guide/quick_start

https://help.aliyun.com/zh/pai/use-cases/fine-tune-a-llama-3-model-with-llama-factory

https://blog.csdn.net/H66778899/article/details/140554007

https://blog.csdn.net/m0_74748557/article/details/142901519

https://community.modelscope.cn/6715ed34cd8b2677c3d316c5.html

https://www.53ai.com/news/OpenSourceLLM/2024072585037.html

https://blog.csdn.net/qq_43799400/article/details/134182459

About LLaMA Factory

Environment Setup

Fine-Tuning Parameter Configuration

Fine-Tuning Methods

Learning Rate

Cutoff Length

Computational Type

Training Epochs

Gradient Accumulation

LoRA Rank

LoRA Scaling Factor

LoRA + Learning Rate Ratio

LoRA Action Module

Start Training

Effect Evaluation

Exporting the Model

Deploying the Model with Ollama

Installation

Format Comparison

Artificial Intelligence Issues

A Little Side Story

Conclusion

Reference Links

Leave a Comment Cancel reply