Understanding Reinforcement Learning in ChatGPT

Author: Chen Zhiyan

This article is about 2400 words long and is recommended for an 8-minute read.
This article introduces reinforcement learning in ChatGPT.

ChatGPT is based on OpenAI’s GPT-3.5 and is a derivative product of InstructGPT. It introduces a new method of incorporating human feedback into the training process, allowing the model’s output to better align with user intentions. OpenAI’s 2022 paper “Training Language Models to Follow Instructions with Human Feedback” provides an in-depth description of reinforcement learning from human feedback (RLHF).

The creators combined supervised learning and reinforcement learning to fine-tune ChatGPT, with the reinforcement learning component being its unique feature. Researchers used a special technique called “Reinforcement Learning from Human Feedback (RLHF)” to minimize harmful, untruthful, and/or biased outputs using human feedback in the training loop.

This method consists of the following three steps:

Step 1: Supervised fine-tuning, where the pre-trained language model is fine-tuned on a relatively small set of demonstration data managed by annotators to learn a supervised policy (SFT model), generating outputs based on a selected list of prompts, which represents the baseline model;

Step 2: “Imitating Human Preferences”: Annotators are asked to vote on a relatively larger number of SFT model outputs, creating a new dataset consisting of comparative data. A new reward model (RM) is trained on this dataset;

Step 3: Proximal Policy Optimization (PPO): The reward model is further fine-tuned to improve the SFT model. The result of this step is what is known as the policy model.

Step 1 is performed only once, while Steps 2 and 3 can be iteratively repeated: collecting more comparative data on the current best policy model, training a new reward model, and then training a new policy based on that.

Supervised Fine-Tuning (SFT) Model

The first step is to collect demonstration data to train a supervised policy model called the SFT model.

Data Collection: A list of prompts is selected, and a group of human annotators is asked to write the expected output responses. ChatGPT uses two different sources of prompts: some are directly obtained from annotators or developers, while others are sampled from OpenAI’s API requests (i.e., from GPT-3 clients). The entire process is slow and costly, resulting in a relatively small, high-quality managed dataset (approximately 12-15k data points) that will be used to fine-tune the pre-trained language model.

Model Selection: Developers chose a pre-trained model from the GPT-3.5 series rather than fine-tuning the original GPT-3 model. The latest baseline model, text-davinci-003, which is also a GPT-3 model, is used to fine-tune major programming code.

Due to the limited amount of data at this step, the SFT model obtained after this process is likely to generate text that users are less concerned about and often results in misalignment issues. The problem here is that supervised learning at this step has a very high scalability cost.

To overcome the above issues, a larger dataset is created using human labeling, a slow and costly process, employing a new strategy to create a reward model for the human-labeled SFT model outputs—explained in more detail in the following content.

Reward Model

After training the SFT model in Step 1, this model generates more consistent responses to user prompts. The next step is to train the reward model, where the model input consists of a series of prompts and responses, and the output is a scaled value known as the reward. A reward model is needed to utilize reinforcement learning, where the model learns to produce outputs that maximize its reward.

A target function (reward model) is learned directly from the data. The purpose of this function is to assign a score to the outputs of the SFT model, which is proportional to the desirability of the outputs as judged by humans. In practice, this reflects the specific preferences of the selected group of annotators and the criteria they agree to follow. Ultimately, this process extracts an automated answering system that mimics human preferences from the data. It works as follows:

Select a list of prompts, and the SFT model generates multiple outputs for each prompt (between 4 and 9);
Annotators rank the outputs from best to worst, resulting in a new labeled dataset where the rankings are labeled. This dataset is approximately 10 times the size of the SFT model dataset;
A reward model (RM) is trained using this new data. This model takes certain SFT model outputs as input and ranks them based on human preferences.

For annotators, ranking outputs is much easier than generating them from scratch, so this process has better scaling efficiency. In practice, a dataset is generated from 30-40k prompts, requiring the outputs to be ranked from best to worst, creating output ranking combinations. During the ranking phase, outputs labeled differently are presented to different prompts.

Fine-Tuning the SFT Model Using the Proximal Policy Optimization (PPO) Algorithm

Next, reinforcement learning is used to fine-tune the SFT policy to optimize the reward model. The model receives random prompts and returns responses. The responses are generated using the “policy” learned by the model in Step 2. The policy indicates that the machine has learned to use strategies to achieve its goals; in this case, maximizing its reward. Based on the reward model developed in Step 2, a scaling reward value is determined for the prompt-response pairs. The reward is then fed back into the model to improve the policy. The algorithm used is Proximal Policy Optimization (PPO), and the fine-tuned model is called the PPO model.

In 2017, Schulman et al. introduced Proximal Policy Optimization (PPO), which is used to update the model’s policy when generating each response. PPO includes a Kullback–Leibler (KL) penalty for each token from the SFT model. KL divergence measures the similarity between two distribution functions and penalizes extreme distances. In this case, using KL penalties reduces the distance between the responses and the outputs of the SFT model trained in Step 1 to avoid over-optimizing the reward model and deviating too far from the human intent dataset. The specific implementation of the PPO algorithm has been described in Section 4.4 above; the key points of this method are:

PPO is an algorithm used to train agents in reinforcement learning for policy optimization, like DQN (Deep Q-Network) algorithms, which learn and update policies directly from the current policy rather than learning from past experiences. PPO continuously adjusts the current policy based on the actions taken by the agent and the rewards it receives;
PPO uses a confidence interval optimization method to train the policy, limiting the changes in the policy to a certain distance from the previous policy to ensure stability. This contrasts with other policy gradient methods, which sometimes make large updates to the policy, leading to instability in learning;
PPO uses a value function to estimate the expected return for a given state or action. The advantage function is calculated using the value function, representing the difference between expected return and current return. By comparing the actions taken by the current policy with those that should have been taken by the previous policy, PPO can update the policy more intelligently based on the estimated value of the actions.

In this step, the PPO model is initialized by the SFT model, and the value function is initialized by the reward model. The environment is a bandit environment, which displays a random prompt and expects a response to that prompt. After giving the prompt and response, a reward is generated (determined by the reward model). KL penalties are added to each labeled SFT model to optimize the reward model.

Conclusion

ChatGPT introduces reinforcement learning with the Proximal Policy Optimization (PPO) algorithm to fine-tune the SFT model, incorporating human feedback into the model training process, significantly improving the accuracy of the model training.

Editor: Huang Jiyan

Proofreader: Gong Li

Introduction to Datapi Research Department

Founded in early 2017, the Datapi Research Department is divided into multiple groups based on interest, each group follows the overall knowledge sharing and project planning of the research department while having its unique characteristics:

Algorithm Model Group: Actively participates in competitions like Kaggle and creates original hands-on articles;

Research and Analysis Group: Investigates the applications of big data through interviews and explores the beauty of data products;

System Platform Group: Tracks the cutting-edge technology of big data & AI system platforms and engages with experts;

Natural Language Processing Group: Focuses on practice, actively participates in competitions, and plans various text analysis projects;

Manufacturing Big Data Group: Upholds the dream of a strong industrial nation, combining industry, academia, research, and government to extract data value;

Data Visualization Group: Merges information with art, explores the beauty of data, and learns to tell stories using visualization;

Web Crawling Group: Crawls web information and collaborates with other groups to develop creative projects.

Click the end of the article on “Read Original” to sign up as a volunteer for the Datapi Research Department, there’s always a group for you~

Reprint Notice

If you need to reprint, please prominently indicate the author and source at the beginning (Reprinted from: Datapi THU ID: DatapiTHU), and place a prominent QR code of Datapi at the end of the article. For articles with original identification, please send [Article Name – WeChat Official Account Name and ID] to the contact email to apply for whitelist authorization and edit according to requirements.

Unauthorized reprints and adaptations will be pursued legally.

Click “Read Original” to join the organization~

Supervised Fine-Tuning (SFT) Model

Reward Model

Fine-Tuning the SFT Model Using the Proximal Policy Optimization (PPO) Algorithm

Conclusion

Leave a Comment Cancel reply