Fine-Tuning Llama 3 with Hugging Face for $250

Reporting by Machine Heart

Editor: Zhao Yang

Fine-tuning large language models has always been easier said than done. Recently, Hugging Face’s technical director, Philipp Schmid, published a blog that details how to fine-tune large models using libraries and FSDP and Q-Lora available on Hugging Face.

We know that open-source large language models like Llama 3 from Meta, Mistral from Mistral AI, and Jamba from AI21 Labs have become competitors to OpenAI.

However, in most cases, users need to fine-tune these open-source models based on their own data to fully unleash the model’s potential.

While fine-tuning smaller large language models (like Mistral) using Q-Lora on a single GPU is not difficult, efficiently fine-tuning large models like Llama 3 70b or Mixtral has remained a challenge until now.

Therefore, Hugging Face’s technical director, Philipp Schmid, introduced how to use PyTorch FSDP and Q-Lora, with the help of Hugging Face’s TRL, Transformers, PEFT, and datasets libraries, to fine-tune Llama 3. In addition to FSDP, the author also adapted Flash Attention v2 after the PyTorch 2.2 update.

The main steps for fine-tuning are as follows:

Set up the development environment
Create and load the dataset
Fine-tune the large language model using PyTorch FSDP, Q-Lora, and SDPA
Test the model and perform inference

Note: The experiments conducted in this article were created and validated on NVIDIA H100 and NVIDIA A10G GPUs. The configuration files and code are optimized for 4xA10G GPUs, each equipped with 24GB of memory. If users have more computing power, the configuration files mentioned in step 3 (yaml files) need to be modified accordingly.

Background Knowledge on FSDP+Q-Lora

Based on a collaborative project involving Answer.AI, Q-Lora creator Tim Dettmers, and Hugging Face, the author summarizes the technical support that Q-Lora and PyTorch FSDP (Fully Sharded Data Parallel) can provide.

The combined use of FSDP and Q-Lora allows users to fine-tune Llama 2 70b or Mixtral 8x7B on just two consumer-grade GPUs (24GB). Details can be found in the article below. Hugging Face’s PEFT library plays a crucial role in this.

Article link: https://www.answer.ai/posts/2024-03-06-fsdp-qlora.html

PyTorch FSDP is a data/model parallelism technology that can split models across GPUs, reducing memory requirements and enabling more efficient training of larger models. Q-LoRA is a fine-tuning method that effectively reduces computational demands and memory usage by utilizing quantization and low-rank adapters.

Setting Up the Development Environment

The first step is to install Hugging Face Libraries and PyTorch, including libraries like TRL, Transformers, and Datasets. TRL is a new library built on top of Transformers and Datasets that makes fine-tuning open-source large language models, RLHF, and alignment easier.

# Install Pytorch for FSDP and FA/SDPA
%pip install "torch==2.2.2" tensorboard

# Install Hugging Face libraries
%pip install  --upgrade "transformers==4.40.0" "datasets==2.18.0" "accelerate==0.29.3" "evaluate==0.4.1" "bitsandbytes==0.43.1" "huggingface_hub==0.22.2" "trl==0.8.6" "peft==0.10.0"

Next, log in to Hugging Face to obtain the Llama 3 70b model.

Creating and Loading the Dataset

Once the environment is set up, we can start creating and preparing the dataset. The dataset used for fine-tuning should contain example samples of the tasks that users want to solve. Reading “How to Fine-Tune LLMs in 2024 with Hugging Face” can provide further insights on creating datasets.

Article link: https://www.philschmid.de/fine-tune-llms-in-2024-with-trl#3-create-and-prepare-the-dataset

The author used the HuggingFaceH4/no_robots dataset, a high-quality dataset containing 10,000 instructions and samples, which has been well-annotated. This data can be used for supervised fine-tuning (SFT) to make the language model better follow human instructions. The no_robots dataset is based on the human instruction dataset described in OpenAI’s InstructGPT paper and mainly consists of single-sentence instructions.

{"messages": [{"role": "system", "content": "You are..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}
{"messages": [{"role": "system", "content": "You are..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}
{"messages": [{"role": "system", "content": "You are..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}

The 10,000 samples in the no_robots dataset are divided into 9,500 training samples and 500 testing samples, with some samples lacking system information. The author used the datasets library to load the dataset, added the missing system information, and saved them to separate JSON files. The sample code is as follows:

from datasets import load_dataset

# Convert dataset to OAI messages
system_message = """You are Llama, an AI assistant created by Philipp to be helpful and honest. Your knowledge spans a wide range of topics, allowing you to engage in substantive conversations and provide analysis on complex subjects."""

def create_conversation(sample):
    if sample["messages"][0]["role"] == "system":
        return sample    else:
      sample["messages"] = [{"role": "system", "content": system_message}] + sample["messages"]
      return sample

# Load dataset from the hub
dataset = load_dataset("HuggingFaceH4/no_robots")


# Add system message to each conversation
columns_to_remove = list(dataset["train"].features)
columns_to_remove.remove("messages")
dataset = dataset.map(create_conversation, remove_columns=columns_to_remove,batched=False)


# Filter out conversations which are corrupted with wrong turns, keep which have even number of turns after adding system message
dataset["train"] = dataset["train"].filter(lambda x: len(x["messages"][1:]) % 2 == 0)
dataset["test"] = dataset["test"].filter(lambda x: len(x["messages"][1:]) % 2 == 0)


# save datasets to disk
dataset["train"].to_json("train_dataset.json", orient="records", force_ascii=False)
dataset["test"].to_json("test_dataset.json", orient="records", force_ascii=False)

Fine-Tuning LLM Using PyTorch FSDP, Q-Lora, and SDPA

Next, we will fine-tune the large language model using PyTorch FSDP, Q-Lora, and SDPA. The author runs the model on distributed devices, so they need to use torchrun and a Python script to start training.

The author wrote a script called run_fsdp_qlora.py, which loads the dataset from disk, initializes the model and tokenizer, and starts model training. The script uses the SFTTrainer from the TRL library to fine-tune the model.

SFTTrainer makes supervised fine-tuning of open-source large language models easier to get started with, specifically in the following ways:

Formatted datasets, including formatted multi-turn conversations and instructions (used)
Training only on complete content, ignoring cases with only prompts (not used)
Packing datasets to improve training efficiency (used)
Supporting parameter-efficient fine-tuning techniques, including Q-LoRA (used)
Initializing models and tokenizers for conversation-level task fine-tuning (not used, see below)

Note: The author uses a chat template similar to Anthropic/Vicuna, setting up roles for “User” and “Assistant”. This is because the special tokenizer in the base Llama 3 (e.g., <|begin_of_text|> and <|reserved_special_token_XX|>) has not been trained.

This means that if these tokenizers are to be used in the template, they need to be trained and the embedding layer and lm_head updated, which will incur additional memory requirements. If users have more computing power, they can modify the LLAMA_3_CHAT_TEMPLATE environment variable in the run_fsdp_qlora.py script.

In terms of configuration parameters, the author uses the new TrlParser variable, which allows us to provide hyperparameters in the YAML file or explicitly pass parameters to the CLI to override those in the configuration file, such as —num_epochs 10. Below is the configuration file for fine-tuning Llama 3 70B on 4x A10G GPUs or 4x24GB GPUs.

%%writefile llama_3_70b_fsdp_qlora.yaml# script parameters
model_id: "meta-llama/Meta-Llama-3-70b" # Hugging Face model id
dataset_path: "."                      # path to dataset
max_seq_len:  3072 # 2048              # max sequence length for model and packing of the dataset
# training parameters
output_dir: "./llama-3-70b-hf-no-robot" # Temporary output directory for model checkpoints
report_to: "tensorboard"               # report metrics to tensorboard
learning_rate: 0.0002                  # learning rate 2e-4
lr_scheduler_type: "constant"          # learning rate scheduler
num_train_epochs: 3                    # number of training epochs
per_device_train_batch_size: 1         # batch size per device during training
per_device_eval_batch_size: 1          # batch size for evaluation
gradient_accumulation_steps: 2         # number of steps before performing a backward/update pass
optim: adamw_torch                     # use torch adamw optimizer
logging_steps: 10                      # log every 10 steps
save_strategy: epoch                   # save checkpoint every epoch
evaluation_strategy: epoch             # evaluate every epoch
max_grad_norm: 0.3                     # max gradient norm
warmup_ratio: 0.03                     # warmup ratio
bf16: true                             # use bfloat16 precision
tf32: true                             # use tf32 precision
gradient_checkpointing: true           # use gradient checkpointing to save memory
# FSDP parameters: https://huggingface.co/docs/transformers/main/en/fsdp
fsdp: "full_shard auto_wrap offload" # remove offload if enough GPU memory
fsdp_config:
  backward_prefetch: "backward_pre"
  forward_prefetch: "false"
  use_orig_params: "false"

Note: At the end of training, GPU memory usage will slightly increase (about 10%) due to the overhead of saving the model. Therefore, ensure that there is enough memory on the GPU to save the model.

During the model training phase, the author uses torchrun to utilize samples more flexibly, making it easier to adjust, similar to Amazon SageMaker and Google Cloud Vertex AI.

For torchrun and FSDP, the author needs to set the environment variables ACCELERATE_USE_FSDP and FSDP_CPU_RAM_EFFICIENT_LOADING to inform transformers/accelerate to use FSDP and load the model in a memory-efficient manner.

Note: If you want to disable the CPU offloading feature, you need to change the settings of FSDP. This operation is only applicable to GPUs with more than 40GB of memory.

This article uses the following command to start training:

!ACCELERATE_USE_FSDP=1 FSDP_CPU_RAM_EFFICIENT_LOADING=1 torchrun --nproc_per_node=4 ./scripts/run_fsdp_qlora.py --config llama_3_70b_fsdp_qlora.yaml

Expected Memory Usage:

Full fine-tuning with FSDP requires about 16 GPUs with 80GB memory
FSDP+LoRA requires about 8 GPUs with 80GB memory
FSDP+Q-Lora requires about 2 GPUs with 40GB memory
FSDP+Q-Lora+CPU offloading technology requires 4 GPUs with 24GB memory, along with a GPU with 22GB memory and 127GB of CPU RAM, with a sequence length of 3072 and a batch size of 1.

On a g5.12xlarge server, based on a dataset containing 10,000 samples, the author trained Llama 3 70B for 3 epochs using Flash Attention, which took a total of 45 hours. The cost per hour is $5.67, resulting in a total cost of $255.15. This may sound expensive, but it allows you to fine-tune Llama 3 70B on smaller GPU resources.

If we expand the training to 4x H100 GPUs, the training time will be reduced to about 125 hours. Assuming the cost of one H100 is $5-10 per hour, the total cost will be between $25 and $50.

We need to make trade-offs between usability and performance. If more and better computing resources are available, training time and costs can be reduced; however, even with limited resources, fine-tuning Llama 3 70B is possible. For 4x A10G GPUs, the model needs to be loaded onto the CPU, which reduces overall FLOPS, resulting in varying costs and performance.

Note: During the author’s evaluations and testing, they noticed that about 40 maximum steps (stacking 80 samples into sequences of length three thousand) were sufficient to obtain preliminary results. The training time for 40 steps is about 1 hour, with a cost of roughly $5.

Optional Step: Merging LoRA’s Adapters into the Original Model

When using QLoRA, the author only trains the adapters without modifying the entire model. This means that when saving the model during training, only the adapter weights are saved, not the full model.

If users want to save the complete model to facilitate its use with text generation inferencers, they can use the merge_and_unload method to merge the adapter weights into the model weights and then use the save_pretrained method to save the model. This will save a default model that can be used for inference.

Note: CPU memory needs to be greater than 192GB.

#### COMMENT IN TO MERGE PEFT AND BASE MODEL ####
# from peft import AutoPeftModelForCausalLM


# # Load PEFT model on CPU
# model = AutoPeftModelForCausalLM.from_pretrained(
#     args.output_dir,
#     torch_dtype=torch.float16,
#     low_cpu_mem_usage=True,
# )
# # Merge LoRA and base model and save
# merged_model = model.merge_and_unload()
# merged_model.save_pretrained(args.output_dir,safe_serialization=True, max_shard_size="2GB")

Model Testing and Inference

After training is complete, we need to evaluate and test the model. The author loads different samples from the original dataset and manually evaluates the model. Evaluating generative AI models is not easy, as one input can have multiple correct outputs. Reading “Evaluating LLMs and RAG, a Practical Case Using Langchain and Hugging Face” can provide insights on evaluating generative models.

Article link: https://www.philschmid.de/evaluate-llm

import torch
from peft import AutoPeftModelForCausalLM
from transformers import AutoTokenizer

peft_model_id = "./llama-3-70b-hf-no-robot"


# Load Model with PEFT adapter
model = AutoPeftModelForCausalLM.from_pretrained(
  peft_model_id,
  torch_dtype=torch.float16,
  quantization_config= {"load_in_4bit": True},
  device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(peft_model_id)

Next, load the test dataset and attempt to generate instructions.

from datasets import load_dataset
from random import randint


# Load our test dataset
eval_dataset = load_dataset("json", data_files="test_dataset.json", split="train")
rand_idx = randint(0, len(eval_dataset))
messages = eval_dataset[rand_idx]["messages"][:2]


# Test on sample
input_ids = tokenizer.apply_chat_template(messages,add_generation_prompt=True,return_tensors="pt").to(model.device)
outputs = model.generate(
    input_ids,
    max_new_tokens=512,
    eos_token_id= tokenizer.eos_token_id,
    do_sample=True,
    temperature=0.6,
    top_p=0.9,
)
response = outputs[0][input_ids.shape[-1]:]


print(f"**Query:**\n{eval_dataset[rand_idx]['messages'][1]['content']}\n")
print(f"**Original Answer:**\n{eval_dataset[rand_idx]['messages'][2]['content']}\n")
print(f"**Generated Answer:**\n{tokenizer.decode(response,skip_special_tokens=True)}\n")


# **Query:**
# How long was the Revolutionary War?
# **Original Answer:**
# The American Revolutionary War lasted just over seven years. The war started on April 19, 1775, and ended on September 3, 1783.
# **Generated Answer:**
# The Revolutionary War, also known as the American Revolution, was an 18th-century war fought between the Kingdom of Great Britain and the Thirteen Colonies. The war lasted from 1775 to 1783.

Thus, the main process has been introduced. Don’t hesitate, start operating from the first step!

Original link: https://www.philschmid.de/fsdp-qlora-llama3?continueFlag=7e3b3f9059405e4318552e99bd128514

Fine-Tuning Llama 3 with Hugging Face for $250

For reprints, please contact this public account for authorization

For submissions or inquiries: [email protected]

Leave a Comment Cancel reply