In-Depth Analysis of RL Strategies in Mainstream Open-Source LLMs

The author is from Meta, an internet practitioner, focusing on LLM4Code and LLMinfra.

The original text is from Zhihu, link:

https://zhuanlan.zhihu.com/p/16270225772

This article is for academic/technical sharing only. If there is any infringement, please contact for removal.

RLHF is an important part of LLM training. With the development of open-source models, we observe that some mainstream open-source large models such as DeepSeek, Qwen, LLaMA, etc., adopt different strategies and implementation methods to solve RL problems. These models have their own strengths in the design of the learning process and strategy selection. This article will discuss and summarize the RL strategies adopted by several mainstream open-source models.
DeepSeek series: Early models used DPO for alignment, gradually transitioning to PPO, and recently using GRPO for RLHF phase learning. The RM strategies are also evolving, balancing rule-based RM and model-based RM, while the latest DeepSeek-V3 also employs a self-rewarding strategy, allowing the model to continuously improve itself.
Qwen series: Transitioning from early PPO to DPO (also training RM for sample selection), using DPO in the offline phase and GRPO in the online phase. The latest qwen2.5-coder model only uses offline DPO.
LLaMA: Tends to adopt iterative techniques to optimize the model, combining Rejection Sampling + PPO (or DPO) for model optimization in each round.
There are a few conclusions:
1. The contest between GRPO/PPO and DPO seems to have no clear winner. LLaMA leans towards DPO, DeepSeek prefers GRPO, while Qwen combines both.
2. Regardless of whether GRPO/PPO or DPO is used, RM is particularly crucial (even when using DPO for RL, RM is needed for Rejection Sampling). Various models mention some RM optimization points and key aspects in almost every update.
3. The necessity of the RL phase has reached a consensus; simple SFT is far from sufficient. Especially for strong reasoning scenarios like coding/mathematics, RL plays a key role in enhancing model capabilities.

01

DeepSeek Series

DeepSeek LLM (2024-01)

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism
In the RL phase, only DPO was used, constructing preference data for DPO training, including usefulness and harmlessness data. The candidates for preference data were directly generated by DeepSeek Chat. It was found that DPO can enhance the model’s open-ended generation skills, but there is little difference in performance on standard benchmark tests.

DeepSeek-Coder (2024-01)

DeepSeek-Coder: When the Large Language Model Meets Programming — The Rise of Code Intelligence
RL was not used; only SFT was employed for alignment.

DeepSeek-V2(2024-05)

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
Using GRPO, compared to PPO, GRPO omits the critic model and optimizes the policy model by estimating baseline values from a set of outputs.
It adopts a two-stage training strategy:
First stage: Aimed at enhancing reasoning ability. A reward model focusing on code and mathematical reasoning ability was trained for alignment.
Second stage: Aimed at enhancing human alignment ability. Three reward models (including safety, helpful, and rule-based) were weighted for alignment.
Additionally, many optimizations were made in engineering strategies to enhance training efficiency.
Some RL-related observations and discussions:
  1. DeepSeek-V2 Chat (RL) performed excellently on mathematical and coding tasks, especially in benchmark tests like GSM8K, MATH, and HumanEval, the training in the RL phase significantly improved the model’s performance.
  2. Alignment tax: The model cannot completely avoid performance trade-offs during the alignment process. Especially in the reinforcement learning (RL) phase, the model may perform worse on certain standard benchmark tests (like BBH), even though it performs better on open-ended generation tasks (like dialogue generation).
2. Online RL: Online reinforcement learning (online RL) significantly outperformed offline reinforcement learning (offline RL) in preference alignment experiments. Therefore, the team invested considerable effort to implement an online RL framework to better align DeepSeek-V2’s preferences.

DeepSeek-Coder-V2 (2024-06)

DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence
The overall training strategy is consistent with DeepSeek-V2.
The difference is:
For coding tasks, feedback from the compiler was not used, but a Reward Model was trained. Experiments also showed that the reward model played a role in optimizing and stabilizing training signals, especially in handling complex code generation tasks, where the reward model could provide more reliable feedback to help the model learn and optimize better.
In-Depth Analysis of RL Strategies in Mainstream Open-Source LLMs

DeepSeek-V3(2024-12)

DeepSeek-V3 Technical Report
The RL part still follows the GRPO method of the V2 series, and the Reward Model includes both rule-based and model-based types. The model-based RM is trained from the DeepSeek-V3 SFT checkpoint, inheriting the capabilities of the SFT model. To enhance the reliability of the reward model, the team constructed preference data containing Chain of Thought (CoT), which not only provides the final reward but also evaluates the model’s reasoning process. The paper does not provide many specific details, and it is uncertain whether a PRM was used.
Other related information:
  1. During the RL process, a Self-Rewarding strategy was adopted, especially in areas that are difficult to verify through external tools (such as creative writing). By using DeepSeek-V3’s own voting results as feedback sources, the model can self-improve.

02

Qwen Series

Qwen (2023-09)

Qwen Technical Report
The RL phase used standard PPO. The training of RM is divided into two stages. First, the model undergoes pre-training (Preference Model Pretraining, PMP) using a large amount of comparative data (including pairs of samples with different responses and their preferences). Then, the model is fine-tuned based on human feedback to ensure that the reward model accurately reflects human preferences.

Qwen1.5 (2024-02)

Introducing Qwen1.5
No dedicated technical report was provided, but the blog mentioned that the RL phase used DPO and PPO for alignment.

Qwen2 (2024-07)

Qwen2 Technical Report
Overall, DPO was adopted, divided into offline and online phases. In the offline phase, preference datasets were directly used with DPO for alignment; in the online training phase, the model continuously optimizes its performance through real-time feedback. The specific approach is to sample multiple responses from the current policy model, and then the reward model selects the most preferred and least preferred responses to form preference pairs, which are used for DPO in each training cycle. Although PPO was not directly used, the reward model was still trained to select DPO preference pairs.

Qwen2.5 (Model released in 2024-09, Technical report released in 2024-12)

Qwen2.5 Technical Report
Qwen2.5 still adopts a two-phase approach, using DPO in the offline phase, utilizing execution feedback and answer matching to ensure the quality of generated responses, especially suitable for tasks with standard answers but difficult to evaluate, such as mathematics and code generation. Online RL uses GRPO, which improves the accuracy, coherence, and human preference alignment ability of the model’s generated responses through RM feedback.

Qwen2.5-Coder (2024-09)

Qwen2.5-Coder Technical Report
Qwen2.5-Coder aligns through offline DPO. For simple code, a multi-language code sandbox is used to generate test cases to validate correctness; for complex code, the LLM-as-judge method is used to evaluate code quality. Ultimately, the code DPO data is combined with general data for offline DPO training.

03

LLaMA Series

LLaMA (2023-02)

LLaMA: Open and Efficient Foundation Language Models
No RL phase was involved; only instruction fine-tuning was designed.

LLaMA-2 (2023-06)

Llama 2: Open Foundation and Fine-Tuned Chat Models

In-Depth Analysis of RL Strategies in Mainstream Open-Source LLMs
Combining Rejection Sampling and PPO for iterative optimization, the Reward Model consists of two models (one responsible for Safety and the other for Helpful). In each iteration, the model generates multiple responses and uses the reward model to select the highest-scoring response as the new standard, then fine-tunes the model. This approach gradually improves the model’s performance through multiple sampling and selection. Based on rejection sampling, further optimization is performed using the PPO algorithm.

LLaMA-3&LLaMA-3.1 (2024-06)

The Llama 3 Herd of Models

In-Depth Analysis of RL Strategies in Mainstream Open-Source LLMs
Overall, it is similar to LLaMA-2, adopting an iterative strategy for enhancement (the paper mentions it iterated 6 rounds). The training of the Reward Model differs from LLaMA-2 by removing the margin term in the loss function. The DPO is used for preference optimization, which also differs from LLaMA-2’s use of PPO.

END

Click the card below

Follow us now

Leave a Comment