Detailed Explanation of ChatGPT and InstructGPT

Source: JD Cloud Dolphin Data Science Lab

This article is approximately 7000 words long, suggested reading time is 15 minutes.
To understand ChatGPT, we must first understand InstructGPT.

Introduction

The GPT series is a series of pre-trained models from OpenAI, where GPT stands for Generative Pre-Trained Transformer. As the name suggests, the purpose of GPT is to obtain a universal text model using pre-training techniques based on the Transformer architecture. So far, the published models include text pre-trained GPT-1, GPT-2, GPT-3, and image pre-trained iGPT. It is rumored that the yet-to-be-released GPT-4 is a multimodal model. The recently popular ChatGPT and the one announced earlier this year are sister models released before GPT-4, sometimes referred to as GPT-3.5. ChatGPT and InstructGPT share the same model structure and training methods, both using Instruction Learning and Reinforcement Learning from Human Feedback (RLHF) to guide model training. The only difference lies in how the data is collected. Therefore, to understand ChatGPT, we must first understand InstructGPT.

1. Background Knowledge

Before introducing ChatGPT/InstructGPT, let’s first discuss the foundational algorithms they rely on.

1.1 GPT Series

The three generations of models based on text pre-training, GPT-1, GPT-2, and GPT-3, all use a Transformer-based architecture (Figure 1). The differences lie in the number of layers, the length of word vectors, and other hyperparameters, as detailed in Table 1.

Figure 1: Model structure of the GPT series (where Trm is a Transformer structure)

Table 1: Release dates, parameter counts, and training volumes of the GPT generations

Model	Release Date	Layers	Heads	Word Vector Length	Parameter Count	Pre-training Data Volume
GPT-1	June 2018	12	12	768	117 million	About 5GB
GPT-2	February 2019	48	–	1600	1.5 billion	40GB
GPT-3	May 2020	96	96	12888	175 billion	45TB

GPT-1 was released a few months before BERT. Both use Transformer as the core structure, but GPT-1 was trained using a left-to-right generative pre-training task to obtain a universal pre-trained model, which, like BERT, can be fine-tuned for downstream tasks. At the time, GPT-1 achieved SOTA results on nine NLP tasks, but its model scale and data volume were relatively small, which led to the development of GPT-2.

Compared to GPT-1, GPT-2 did not significantly change the model structure; it simply used a larger model with more parameters and training data (Table 1). The key idea of GPT-2 is the proposition that all supervised learning is a subset of unsupervised language models, which is also the precursor to prompt learning. GPT-2 caused quite a stir upon its release, as the news it generated was often convincing enough to deceive most people, earning it the label of “the most dangerous weapon in AI.” Many major websites even banned the use of news generated by GPT-2.

When GPT-3 was introduced, what sparked more discussion than its superior performance over GPT-2 was its staggering parameter count of 175 billion. In addition to performing common NLP tasks, researchers unexpectedly found that GPT-3 also performed well in generating code in SQL, JavaScript, and other languages, as well as in simple mathematical operations. The training of GPT-3 utilized context learning, a form of meta-learning, where the core idea is to find a suitable initialization range using a small amount of data, allowing the model to fit quickly on a limited dataset and achieve good results.

From the analysis above, we can see that from a performance perspective, GPT has two goals:

Improve the model’s performance on common NLP tasks;
Enhance the model’s generalization ability on other atypical NLP tasks (such as code writing and mathematical operations).

Additionally, since the inception of pre-trained models, a widely criticized issue has been their bias. Because pre-trained models are trained on vast amounts of data at a large parameter scale, compared to expert systems controlled entirely by human rules, pre-trained models operate like black boxes. No one can guarantee that a pre-trained model will not generate content that includes racial or gender discrimination, as its training data of tens of GB or even TB almost certainly contains similar samples. This is the motivation behind the introduction of InstructGPT and ChatGPT, which are summarized in the paper using the 3H framework:

Helpful;
Honest;
Harmless.

OpenAI’s GPT series models are not open-source, but they provide a trial website for the models, which interested individuals can use under certain conditions.

1.2 Instruction Learning and Prompt Learning

Instruction learning was proposed by Quoc V. Le’s team at Google DeepMind in a 2021 paper titled “Finetuned Language Models Are Zero-Shot Learners.” Both instruction learning and prompt learning aim to tap into the knowledge inherent in language models. The difference is that prompts stimulate the language model’s completion ability, such as generating the second half of a sentence or filling in the blanks. Instructions, on the other hand, stimulate the model’s understanding ability by providing clearer directives for the model to take the correct action. We can understand these two different learning methods through the following examples:

Prompt Learning: I bought this necklace for my girlfriend, she loves it, this necklace is too ____.
Instruction Learning: Determine the sentiment of this sentence: I bought this necklace for my girlfriend, she loves it. Options: A=Good; B=Average; C=Bad.

The advantage of instruction learning is that it can also perform zero-shot learning on other tasks after multi-task fine-tuning, while prompt learning is task-specific and has less generalization capability. We can understand fine-tuning, prompt learning, and instruction learning through Figure 2.

Detailed Explanation of ChatGPT and InstructGPT

Figure 2: Comparison of Model Fine-tuning, Prompt Learning, and Instruction Learning

1.3 Reinforcement Learning from Human Feedback

Since the trained model is not very controllable, it can be seen as a fit to the distribution of the training set. Therefore, when feedback is applied to the generative model, the distribution of the training data becomes the most important factor affecting the quality of the generated content. Sometimes, we want the model to be influenced not just by the training data but also to be controllable by humans, ensuring the usefulness, authenticity, and harmlessness of the generated data. The paper frequently mentions the alignment problem, which we can understand as the alignment between the model’s output and what humans prefer. What humans prefer includes not only the fluency and grammatical correctness of the generated content but also its usefulness, authenticity, and harmlessness.

We know that reinforcement learning guides model training through a reward mechanism, which can be seen as the loss function in traditional training mechanisms. The reward calculation is more flexible and diverse than the loss function (for example, the reward for AlphaGO is based on game outcomes), but the trade-off is that the reward calculation is non-differentiable, making it unsuitable for direct backpropagation. The idea behind reinforcement learning is to fit the loss function through extensive sampling of rewards, thus achieving model training. Similarly, human feedback is also non-differentiable, so we can use human feedback as a reward for reinforcement learning, leading to the emergence of reinforcement learning from human feedback (RLHF).

RLHF can be traced back to Google’s 2017 paper “Deep Reinforcement Learning from Human Preferences,” which improved the performance of reinforcement learning in simulated robotics and Atari games through human annotations as feedback.

Detailed Explanation of ChatGPT and InstructGPT

Figure 3: Basic Principle of Reinforcement Learning from Human Feedback

In InstructGPT/ChatGPT, a classic algorithm from reinforcement learning is also used: Proximal Policy Optimization (PPO) proposed by OpenAI. The PPO algorithm is a new type of Policy Gradient algorithm, which is sensitive to step size but difficult to determine an appropriate step size. If the difference in changes between old and new policies during training is too large, it hinders learning. PPO proposes a new objective function that allows for small-batch updates over multiple training steps, solving the step size determination problem in Policy Gradient algorithms. In fact, TRPO was also designed to address this issue, but compared to TRPO, the PPO algorithm is easier to solve.

2. Understanding the Principles of InstructGPT/ChatGPT

With the foundational knowledge above, understanding InstructGPT and ChatGPT becomes much simpler. In simple terms, InstructGPT/ChatGPT both adopt the GPT-3 network structure and train a reward model (RM) that predicts content effectiveness through instruction learning to guide the training of the reinforcement learning model. The training process of InstructGPT/ChatGPT is illustrated in Figure 4.

Detailed Explanation of ChatGPT and InstructGPT

Figure 4: Computational Process of InstructGPT: (1) Supervised Fine-tuning (SFT); (2) Reward Model (RM) training; (3) Reinforcement Learning based on the reward model using PPO.

From Figure 4, we can see that the training of InstructGPT/ChatGPT can be divided into three steps, with the second and third steps involving iterative optimization of the reward model and the reinforcement learning SFT model.

Supervised fine-tuning (SFT) of GPT-3 based on the collected SFT dataset;
Collecting human-annotated comparative data to train the reward model (Reword Model, RM);
Using RM as the optimization objective for reinforcement learning, fine-tuning the SFT model using the PPO algorithm.

Based on Figure 4, we will introduce the data collection and model training aspects of InstructGPT/ChatGPT.

2.1 Data Collection

As shown in Figure 4, the training of InstructGPT/ChatGPT is divided into three steps, each requiring slightly different data, which we will introduce below.

2.1.1 SFT Dataset

The SFT dataset is used to train the first step supervised model, which fine-tunes GPT-3 using newly collected data according to GPT-3’s training methods. Since GPT-3 is a generative model based on prompt learning, the SFT dataset consists of samples made up of prompt-response pairs. Part of the SFT data comes from users using OpenAI’s Playground, while another part comes from 40 labelers hired by OpenAI, who were trained for this task. In this dataset, the labelers’ job is to create instructions based on the content and ensure that the instructions meet the following three criteria:

Simple Tasks: Labelers provide any simple task while ensuring task diversity;
Few-shot Tasks: Labelers provide one instruction along with multiple query-response pairs for that instruction;
User-Relevant: Labelers obtain use cases from the interface and create instructions based on these use cases.

2.1.2 RM Dataset

The RM dataset is used to train the reward model in step 2, where we also need to set a reward objective for InstructGPT/ChatGPT’s training. This reward objective does not need to be differentiable, but it must align as comprehensively and authentically as possible with the content we want the model to generate. Naturally, we can provide this reward through human annotations by giving lower scores to generated content that involves bias, thus encouraging the model not to produce content that humans dislike. The approach taken by InstructGPT/ChatGPT is to first generate a batch of candidate texts and then have labelers rank these generated contents based on quality.

2.1.3 PPO Dataset

The PPO dataset for InstructGPT is unannotated and comes entirely from users of the GPT-3 API. It includes various types of generation tasks provided by different users, with the highest proportions being generation tasks (45.6%), QA (12.4%), brainstorming (11.2%), conversation (8.4%), etc.

2.1.4 Data Analysis

Since InstructGPT/ChatGPT is fine-tuned based on GPT-3 and involves human annotations, the total data volume is not large. Table 2 shows the sources and data volumes of the three datasets.

Detailed Explanation of ChatGPT and InstructGPT

Table 2: Data Distribution of InstructGPT

The appendix A of the paper discusses the data distribution in more detail, and here I’ll list a few factors that may affect model performance:

Over 96% of the data is in English, while the other 20 languages, including Chinese, French, Spanish, etc., together account for less than 4%, which may lead to InstructGPT/ChatGPT being able to generate in other languages but with performance far inferior to English;
There are a total of 9 types of prompts, and the vast majority are generation tasks, which may lead to the model missing certain task types;
The 40 outsourced employees are concentrated in the US and Southeast Asia, which may lead to a narrow value distribution, potentially generating biases and discrimination issues that are more concerning in other regions.

Additionally, the ChatGPT blog mentions that ChatGPT and InstructGPT share the same training methods, with the only difference being in how they collect data. However, there are no more detailed materials on the differences in data collection. Considering that ChatGPT is only used in the conversational domain, I speculate that ChatGPT has two differences in data collection: 1. Increased proportion of conversational tasks; 2. Conversion of prompts into Q&A formats. Of course, this is merely speculation, and a more accurate description will await the publication of ChatGPT’s papers, source code, and other detailed materials.

2.2 Training Tasks

We just introduced that InstructGPT/ChatGPT has a three-step training process. These three steps will involve three models: SFT, RM, and PPO, which we will detail below.

2.2.1 Supervised Fine-tuning (SFT)

This step of training is consistent with GPT-3, and the authors found that allowing the model to overfit slightly helps the training of the subsequent two steps.

2.2.2 Reward Model (RM)

Since the data for training RM is based on rankings from labelers according to generated results, it can be seen as a regression model. The RM structure is the model after removing the last embedding layer of the SFT-trained model. Its input is the prompt and response, and its output is the reward value. Specifically, for each prompt, InstructGPT/ChatGPT randomly generates K outputs (4≤K≤9), then presents these outputs in pairs to each labeler, meaning each prompt shows CK2 results, from which users choose the better output. During training, InstructGPT/ChatGPT treats each prompt’s CK2 response pairs as a batch, which is less prone to overfitting compared to traditional sample-based batching, as each prompt will only be input into the model once.

The loss function of the reward model is expressed as equation (1). The objective of this loss function is to maximize the difference between the responses preferred by labelers and those they do not prefer.

(1) loss(θ) = -1(K2) E(x,yw,yl)∼D[log(σ(rθ(x,yw)−rθ(x,yl)))]

where rθ(x,y) is the reward value for prompt x and response y under parameters θ, yw is the response preferred by the labeler, and yl is the response not preferred. D is the entire training dataset.

2.2.3 Reinforcement Learning Model (PPO)

Reinforcement learning and pre-trained models have been two of the hottest AI directions in recent years. Previously, many researchers argued that reinforcement learning is not very suitable for application in pre-trained models because it is difficult to establish a reward mechanism based on the model’s output. However, InstructGPT/ChatGPT counterintuitively achieved this by integrating human annotations, marking the biggest innovation in this algorithm.

As shown in Table 2, the training dataset for PPO comes entirely from the API. It continues training the SFT model using the reward model obtained in step 2. Many times, reinforcement learning is very difficult to train, and InstructGPT/ChatGPT encountered two issues during training:

Issue 1: As the model updates, the data generated by the reinforcement learning model diverges more and more from the training data of the reward model. The authors’ solution was to add a KL penalty term βlog(πϕRL(y|x)/πSFT(y|x)) to the loss function to ensure that the outputs of the PPO model do not diverge too much from the SFT outputs.
Issue 2: Training solely with the PPO model leads to a significant drop in performance on general NLP tasks. The authors’ solution was to include a general language model objective in the training target γEx∼Dpretrain [log(πϕRL(x))], which is referred to as PPO-ptx in the paper.

In summary, the training objective for PPO is expressed as equation (2). (2) objective(ϕ) = E(x,y)∼DπϕRL[rθ(x,y)−βlog(πϕRL(y|x)/πSFT(y|x))] + γEx∼Dpretrain [log(πϕRL(x))]

3. Performance Analysis of InstructGPT/ChatGPT

It is undeniable that the performance of InstructGPT/ChatGPT is impressive, especially after incorporating human annotations, which significantly enhance the model’s “values” and the accuracy of human behavior patterns. Based solely on the technical solutions and training methods of InstructGPT/ChatGPT, we can analyze the performance improvements it can bring.

3.1 Advantages

The performance of InstructGPT/ChatGPT is more realistic than GPT-3:This is easy to understand, as GPT-3 inherently possesses strong generalization and generation capabilities. With InstructGPT/ChatGPT introducing different labelers for prompt writing and result ranking, and fine-tuning on top of GPT-3, we can expect higher rewards for more realistic data during reward model training. The authors also compared their performance with GPT-3 on the TruthfulQA dataset, and the experimental results showed that even the 1.3 billion small-size PPO-ptx outperformed GPT-3.
InstructGPT/ChatGPT shows some improvement in model harmlessness compared to GPT-3:The principle is the same as above. However, the authors found that InstructGPT did not show significant improvements on datasets involving discrimination and bias. This is because GPT-3 was already a very effective model, and the probability of generating problematic samples containing harmful, discriminatory, or biased content was low. The limited data collected and annotated by 40 labelers may not be sufficient to optimize the model in these aspects, resulting in minimal or undetectable performance improvements.
InstructGPT/ChatGPT has strong coding capabilities:Firstly, GPT-3 already has strong coding capabilities, and the API built on GPT-3 has accumulated a large amount of coding data. Additionally, some OpenAI employees participated in data collection. Given the extensive coding-related data and human annotations, it is not surprising that InstructGPT/ChatGPT possesses strong coding capabilities.

3.2 Disadvantages

InstructGPT/ChatGPT may reduce performance on general NLP tasks:As discussed during the PPO training, while modifying the loss function can alleviate this issue, it has not been completely resolved.
Sometimes InstructGPT/ChatGPT produces absurd outputs:Although InstructGPT/ChatGPT utilizes human feedback, the limited human resources mean that the most significant influence on model performance comes from the supervised language model tasks, where human feedback serves only a corrective role. Therefore, it is likely that due to the limited correction data or the misleading nature of supervised tasks (which focus solely on the model’s output without considering what humans want), the content it generates may lack authenticity. This is akin to a student who, despite having a teacher’s guidance, may not learn all knowledge points.
The model is highly sensitive to instructions:This can also be attributed to the insufficient amount of labeled data since instructions are the sole clue for the model to produce outputs. If the quantity and variety of instructions are not adequately trained, the model may exhibit this issue.
Overinterpretation of simple concepts by the model:This may be due to labelers tending to reward longer outputs more highly during the comparison of generated content.
Harmful instructions may lead to harmful responses:For example, InstructGPT/ChatGPT may provide action plans in response to users’ requests for an “AI plan to destroy humanity” (Figure 5). This occurs because InstructGPT/ChatGPT assumes that the instructions written by labelers are reasonable and value-aligned, without making a detailed judgment on the user’s input, leading the model to respond to any input. Although the subsequent reward model may assign lower reward values to such outputs, the model must consider both its values and the match between the generated content and the instruction. Therefore, it is possible for the model to generate outputs that have problematic values.

Detailed Explanation of ChatGPT and InstructGPT

Figure 5: A plan for the destruction of humanity generated by ChatGPT.

3.3 Future Work

Having analyzed the technical solutions and issues of InstructGPT/ChatGPT, we can also identify potential optimization angles for InstructGPT/ChatGPT.

Cost reduction and efficiency improvement in human annotations:InstructGPT/ChatGPT employs a team of 40 labelers, but based on the model’s performance, this team is insufficient. How to enable humans to provide more effective feedback and combine human performance with model performance organically and intelligently is crucial.
Model’s generalization/correction capabilities regarding instructions:As instructions are the sole clues for the model to produce outputs, the model’s heavy reliance on them necessitates enhancing its generalization capabilities regarding instructions and its ability to correct erroneous instructions. This is a significant task for improving model experience, as it not only allows for a broader range of applications but also makes the model more “intelligent.”
Avoiding performance decline in general tasks:This may require designing a more reasonable approach to utilizing human feedback or adopting more advanced model structures. While many issues discussed regarding InstructGPT/ChatGPT could be resolved by providing more labeled data, this could lead to a more severe decline in general NLP task performance. Thus, a solution is needed to balance the 3H of generated results and the performance of general NLP tasks.

3.4 Hot Topics Related to InstructGPT/ChatGPT

Will the emergence of ChatGPT lead to job losses for junior programmers?From the principles of ChatGPT and the generated content leaked online, ChatGPT generates code that often runs correctly. However, a programmer’s job involves more than just writing code; finding solutions to problems is more critical. Therefore, ChatGPT will not replace programmers, especially senior ones. Instead, like many current code generation tools, it will become a very useful tool for programmers.
Stack Overflow announces temporary rules:ChatGPT is banned. ChatGPT is fundamentally a text generation model, and while it can generate code, it is better at producing misleading text. The code or solutions generated by text generation models cannot guarantee they are runnable or solve the problem, but the deceptive text can confuse many users seeking answers. Therefore, Stack Overflow’s ban on ChatGPT is a necessary measure to maintain the quality of the forum.
Chatbot ChatGPT generates a “plan to destroy humanity” under prompting; what issues should we pay attention to in AI development?ChatGPT’s “plan to destroy humanity” is generated based on unforeseen instructions, forcibly fitting based on massive data. Although these contents appear very realistic and well-expressed, this only demonstrates ChatGPT’s strong generation capabilities and does not indicate that ChatGPT possesses the intention to destroy humanity. It is merely a text generation model, not a decision-making model.

4. Conclusion

Like many algorithms that emerged, ChatGPT has garnered widespread attention in the industry and prompted human reflection on AI due to its effectiveness in being useful, authentic, and harmless. However, upon examining its algorithmic principles, we find that it is not as terrifying as the industry claims. Instead, we can learn many valuable insights from its technical solutions. The most significant contribution of InstructGPT/ChatGPT in the AI field is the clever combination of reinforcement learning and pre-trained models. Moreover, by incorporating human feedback, it enhances the model’s usefulness, authenticity, and harmlessness. ChatGPT also further increases the cost of large models; previously, the competition was merely about data volume and model scale, but now it even includes the cost of hired outsourcing, making it more daunting for individual workers.

Editor: Yu Tengkai

Proofreader: Yang Xuejun

Leave a Comment Cancel reply