Detailed Explanation of ChatGPT/InstructGPT

Introduction

The GPT series is a set of pre-trained articles by OpenAI. GPT stands for Generative Pre-Trained Transformer. As the name suggests, the purpose of GPT is to obtain a general text model using pre-training techniques based on the Transformer architecture. Currently, published papers include text pre-trained models GPT-1, GPT-2, and GPT-3, as well as the image pre-trained model iGPT. It is rumored that the yet-to-be-released GPT-4 is a multimodal model. Recently popular ChatGPT, along with InstructGPT, announced earlier this year, is a pair of sister models and was released as a warm-up model before GPT-4, sometimes referred to as GPT-3.5. ChatGPT and InstructGPT are identical in model structure and training methods, both utilizing Instruction Learning and Reinforcement Learning from Human Feedback (RLHF) to guide model training; the only difference lies in the data collection methods. Therefore, to understand ChatGPT, we must first comprehend InstructGPT.

1. Background Knowledge

Before introducing ChatGPT/InstructGPT, we will first discuss the foundational algorithms they rely on.

1.1 GPT Series

The three generations of models based on text pre-training, GPT-1, GPT-2, and GPT-3, all adopt a model structure centered around the Transformer (see Figure 1). The differences lie in hyperparameters such as the number of layers and the length of word vectors, detailed in Table 1.

Figure 1: Model structure of the GPT series (where Trm is a Transformer structure)

Table 1: Release dates, parameter counts, and training volumes of the various GPT generations

Model	Release Date	Layers	Heads	Word Vector Length	Parameter Count	Pre-training Data Volume
GPT-1	June 2018	12	12	768	117 million	~5GB
GPT-2	February 2019	48	–	1600	1.5 billion	40GB
GPT-3	May 2020	96	96	12888	175 billion	45TB

GPT-1 was released a few months before BERT. Both utilize the Transformer as their core structure; however, GPT-1 constructs its pre-training task through a left-to-right generative approach, resulting in a general pre-trained model that, like BERT, can be fine-tuned for downstream tasks. GPT-1 achieved state-of-the-art results on nine NLP tasks, but its model size and data volume were relatively small, leading to the emergence of GPT-2.

Compared to GPT-1, GPT-2 did not significantly change the model structure; it simply used a model with more parameters and more training data (see Table 1). The key idea of GPT-2 was the proposition that all supervised learning is a subset of unsupervised language models, which also laid the groundwork for prompt learning. GPT-2 caused quite a stir upon its release, as its generated news could deceive most humans, achieving a level of realism that blurred the line between truth and falsehood. It was even referred to as “the most dangerous weapon in AI,” leading many major websites to prohibit the use of GPT-2 generated news.

When GPT-3 was introduced, not only did it surpass GPT-2 in performance, but its parameter count of 175 billion also sparked extensive discussion. In addition to performing common NLP tasks, researchers unexpectedly found that GPT-3 also excelled in writing code in SQL, JavaScript, and other languages, as well as performing simple mathematical operations. GPT-3’s training utilized in-context learning, a form of meta-learning, where the core idea is to find a suitable initialization range using a small amount of data, allowing the model to fit quickly on limited datasets and achieve good results.

From the analysis above, we can see that GPT has two performance goals:

Improve model performance on common NLP tasks;
Enhance the generalization ability of the model on other atypical NLP tasks (e.g., code writing, mathematical operations).

Moreover, since the inception of pre-trained models, a widely criticized issue has been the bias inherent in these models. Pre-trained models are trained on vast amounts of data using models with enormous parameter counts, making them akin to black boxes compared to expert systems that are entirely controlled by human rules. No one can guarantee that a pre-trained model won’t generate content containing racial or gender discrimination, as its training data, which can range from several GBs to several TBs, almost certainly includes such samples. This is the motivation behind the development of InstructGPT and ChatGPT, which are summarized in their paper using the 3H framework:

Helpful;
Honest;
Harmless.

OpenAI’s GPT series models are not open-sourced, but they provide a trial website for the models, which interested users can access under certain conditions.

1.2 Instruction Learning and Prompt Learning

Instruction learning is a concept proposed by Quoc V. Le’s team at Google DeepMind in their 2021 paper titled “Finetuned Language Models Are Zero-Shot Learners.” Both instruction learning and prompt learning aim to tap into the knowledge inherent in language models. The difference is that prompts stimulate the language model’s completion ability, such as generating the second half of a sentence based on the first half or filling in blanks. Instructions stimulate the model’s understanding capability by providing clearer directives for the model to take correct actions. We can understand these two different learning methods through the following examples:

Prompt Learning: I bought this necklace for my girlfriend, and she loves it; this necklace is so ____.
Instruction Learning: Determine the sentiment of this sentence: I bought this necklace for my girlfriend, and she loves it. Options: A=Good; B=Average; C=Bad.

The advantage of instruction learning is that it can perform zero-shot on other tasks after multi-task fine-tuning, while prompt learning is task-specific and lacks the same level of generalization. We can understand the differences between fine-tuning, prompt learning, and instruction learning through Figure 2.

Detailed Explanation of ChatGPT/InstructGPT

Figure 2: Similarities and differences between model fine-tuning, prompt learning, and instruction learning

1.3 Reinforcement Learning from Human Feedback

Since the models trained are not entirely controllable, they can be viewed as a fit to the distribution of the training set. When feedback is fed back into the generative model, the distribution of the training data becomes the most crucial factor affecting the quality of the generated content. Sometimes we wish the model to be influenced not just by the training data but also to be controllable by humans, ensuring the usefulness, authenticity, and harmlessness of the generated data. The paper frequently mentions the alignment problem, which we can understand as aligning the model’s output with what humans prefer. Human preferences include not only the fluency and grammatical correctness of generated content but also its usefulness, authenticity, and harmlessness.

We know that reinforcement learning guides model training through a reward mechanism, which can be viewed as a loss function in traditional model training. The reward calculation is more flexible and diverse than loss functions (e.g., the reward for AlphaGO is the win/lose outcome of a game), but the cost of this flexibility is that the reward calculation is non-differentiable, making it unsuitable for direct backpropagation. The idea behind reinforcement learning is to fit the loss function through extensive sampling of rewards, thereby enabling model training. Similarly, human feedback is also non-differentiable, so we can treat human feedback as reinforcement learning rewards, leading to the emergence of reinforcement learning from human feedback (RLHF).

RLHF can be traced back to Google’s 2017 paper “Deep Reinforcement Learning from Human Preferences,” which improved reinforcement learning performance in simulated robotics and Atari games through human annotations as feedback.

Detailed Explanation of ChatGPT/InstructGPT

Figure 3: Basic principles of reinforcement learning from human feedback

In InstructGPT/ChatGPT, a classic algorithm from reinforcement learning is also used: Proximal Policy Optimization (PPO), proposed by OpenAI. The PPO algorithm is a new type of policy gradient algorithm, which is very sensitive to step size and difficult to determine an appropriate step size. If the difference between the new and old policies during training is too large, it hinders learning. PPO introduces a new objective function that allows for small batch updates across multiple training steps, addressing the difficulty of determining step size in policy gradient algorithms. TRPO was also designed to solve this issue, but compared to TRPO, the PPO algorithm is easier to solve.

2. Understanding the Principles of InstructGPT/ChatGPT

With the foundational knowledge above, understanding InstructGPT and ChatGPT becomes much simpler. In brief, both InstructGPT and ChatGPT utilize the GPT-3 architecture, constructing training samples through instruction learning to train a reward model (RM) that predicts the effectiveness of the generated content, ultimately guiding the training of the reinforcement learning model through the scores from this reward model. The training process of InstructGPT/ChatGPT is illustrated in Figure 4.

Detailed Explanation of ChatGPT/InstructGPT

Figure 4: The computational flow of InstructGPT: (1) Supervised Fine-Tuning (SFT); (2) Training of the Reward Model (RM); (3) Reinforcement learning based on the reward model using PPO.

From Figure 4, we can see that the training of InstructGPT/ChatGPT can be divided into three steps, where the reward model and the reinforcement learning SFT model can be iteratively optimized.

Supervised fine-tuning (SFT) of GPT-3 based on the collected SFT dataset;
Collect human-annotated comparative data to train the reward model (Reward Model, RM);
Use RM as the optimization target for reinforcement learning, fine-tuning the SFT model using the PPO algorithm.

According to Figure 4, we will introduce the data collection and model training aspects of InstructGPT/ChatGPT separately.

2.1 Data Collection

As shown in Figure 4, the training of InstructGPT/ChatGPT is divided into three steps, each requiring slightly different data, which we will introduce separately.

2.1.1 SFT Dataset

The SFT dataset is used to train the first step supervised model, fine-tuning GPT-3 using newly collected data according to GPT-3’s training method. Since GPT-3 is a generative model based on prompt learning, the SFT dataset consists of prompt-response pairs. Part of the SFT data comes from users using OpenAI’s Playground, while another part comes from 40 labelers hired by OpenAI, who were trained for this task. In this dataset, the labelers’ job was to create instructions based on the content, ensuring that the instructions met the following three criteria:

Simple Tasks: The labeler provides any simple task while ensuring task diversity;
Few-shot Tasks: The labeler provides one instruction along with multiple query-response pairs for that instruction;
User-Relevant: The labeler retrieves use cases from the interface and writes instructions based on these use cases.

2.1.2 RM Dataset

The RM dataset is used to train the second step reward model, where we need to set a reward target for the training of InstructGPT/ChatGPT. This reward target does not need to be differentiable but must align as comprehensively and authentically as possible with the content we want the model to generate. Naturally, we can provide this reward through human annotations, giving lower scores to generated content that involves bias, thereby encouraging the model to avoid generating undesirable content. InstructGPT/ChatGPT’s approach is to first have the model generate a batch of candidate texts, which are then ranked by labelers based on the quality of the generated content.

2.1.3 PPO Dataset

The PPO dataset for InstructGPT is unannotated and comes entirely from user interactions with the GPT-3 API. This includes various types of generation tasks provided by different users, with the highest proportions being generation tasks (45.6%), Q&A (12.4%), brainstorming (11.2%), and dialogue (8.4%).

2.1.4 Data Analysis

Since InstructGPT/ChatGPT are fine-tuned on GPT-3 and involve human annotations, their total data volume is not large. Table 2 shows the sources and data volume of the three datasets.

Detailed Explanation of ChatGPT/InstructGPT

Table 2: Data distribution of InstructGPT

The appendix A of the paper discusses the data distribution in more detail. Here, I highlight several factors that may impact model performance:

Over 96% of the data is in English, with the other 20 languages, including Chinese, French, Spanish, etc., accounting for less than 4% combined. This may lead to InstructGPT/ChatGPT being able to generate in other languages, but the performance is likely far inferior to that in English;
There are nine types of prompts, and the vast majority are generation tasks, which may lead to gaps in task coverage;
40 outsourced employees are concentrated in the United States and Southeast Asia, which may lead to a narrow distribution of values influencing the model. The value system of InstructGPT/ChatGPT is a combination of these 40 outsourced employees’ values, which may generate issues related to discrimination and bias that are more pertinent in other regions.

Additionally, the ChatGPT blog mentions that ChatGPT and InstructGPT have the same training method, with the only difference being in data collection, but no further details are provided regarding the specifics of data collection differences. Given that ChatGPT is only used in dialogue scenarios, I speculate that there are two differences in data collection for ChatGPT: 1. An increased proportion of dialogue tasks; 2. A shift from prompts to Q&A formats. However, this is merely speculation, and a more accurate description will await the publication of ChatGPT’s paper, source code, and more detailed materials.

2.2 Training Tasks

We just introduced that InstructGPT/ChatGPT involves three training steps. These three steps will involve three models: SFT, RM, and PPO, which we will detail below.

2.2.1 Supervised Fine-Tuning (SFT)

This training step is consistent with GPT-3, and the authors found that allowing the model to overfit slightly helps with the subsequent two training steps.

2.2.2 Reward Model (RM)

Since the data for training RM is ranked by labelers based on generated results, it can be viewed as a regression model. The RM structure is derived from the model after SFT by removing the final embedding layer. Its input consists of prompts and responses, while the output is a reward value. Specifically, for each prompt, InstructGPT/ChatGPT will randomly generate K outputs (4≤K≤9), then present each pair of outputs to labelers, resulting in CK2 outputs being displayed for each prompt. Users then select the better output. During training, InstructGPT/ChatGPT treats each prompt’s CK2 response pairs as a batch, which is less prone to overfitting than traditional sample-based training because each prompt is input into the model only once.

The loss function for the reward model is expressed as equation (1). The goal of this loss function is to maximize the difference between responses that labelers prefer and those they do not.

(1) loss(θ) = -1/(K2) E(x,yw,yl)∼D[log(σ(rθ(x,yw)−rθ(x,yl)))]

Where rθ(x,y) is the reward value for prompt x and response y under parameters θ, yw is the response preferred by the labeler, yl is the response not preferred by the labeler, and D is the entire training dataset.

2.2.3 Reinforcement Learning Model (PPO)

Reinforcement learning and pre-trained models are two of the hottest AI directions in recent years. Many researchers previously argued that reinforcement learning was not well-suited for application in pre-trained models due to the difficulty of establishing a reward mechanism based on model output. However, InstructGPT/ChatGPT counterintuitively achieved this by combining human annotations, which is the greatest innovation of this algorithm.

As shown in Table 2, the PPO training set is entirely sourced from the API. It uses the reward model obtained from step 2 to guide the continued training of the SFT model. Often, reinforcement learning is challenging to train, and InstructGPT/ChatGPT faced two problems during training:

Problem 1: As the model updates, the data generated by the reinforcement learning model diverges increasingly from the training data used for the reward model. The authors’ solution was to add a KL penalty term βlog(πϕRL(y|x)/πSFT(y|x)) to the loss function to ensure that the outputs of the PPO model do not diverge significantly from those of the SFT model.
Problem 2: Training solely with the PPO model can lead to a significant decline in performance on general NLP tasks. The authors’ solution was to incorporate a general language model objective γE_x∼Dpretrain[log(πϕRL(x))] into the training objective, referred to as PPO-ptx in the paper.

In summary, the training objective for PPO is expressed as equation (2). (2) objective(ϕ) = E(x,y)∼DπϕRL[rθ(x,y)−βlog(πϕRL(y|x)/πSFT(y|x))]+γE_x∼Dpretrain[log(πϕRL(x))]

3. Performance Analysis of InstructGPT/ChatGPT

It is undeniable that the performance of InstructGPT/ChatGPT is excellent, especially after the introduction of human annotations, significantly improving the model’s “value system” and the authenticity of its behavior patterns. Based solely on the technical solutions and training methods of InstructGPT/ChatGPT, we can analyze the potential improvements it can bring.

3.1 Advantages

InstructGPT/ChatGPT performs more authentically than GPT-3:This is easily understood, as GPT-3 already possesses strong generalization and generation capabilities. With the introduction of various labelers for prompt writing and result ranking, and fine-tuning on top of GPT-3, we can expect higher rewards for more authentic data when training the reward model. The authors also compared their performance with GPT-3 on the TruthfulQA dataset, revealing that even the smaller 1.3 billion parameter PPO-ptx model outperformed GPT-3.
InstructGPT/ChatGPT shows some improvement in harmlessness over GPT-3:The principle is the same. However, the authors found that InstructGPT did not show significant improvements on datasets related to discrimination and bias. This is because GPT-3 itself is already a highly effective model, with a low probability of generating problematic samples that contain harmful, discriminatory, or biased content. The limited data collected and annotated by just 40 labelers may not sufficiently optimize the model in these areas, resulting in minimal or undetectable improvements.
InstructGPT/ChatGPT possesses strong coding capabilities:Firstly, GPT-3 already has strong coding capabilities, and the API built on GPT-3 has accumulated a wealth of coding data. Additionally, some internal OpenAI employees participated in data collection. Given the substantial data related to coding and human annotations, it is not surprising that InstructGPT/ChatGPT exhibits impressive coding capabilities.

3.2 Disadvantages

InstructGPT/ChatGPT may reduce model performance on general NLP tasks:This issue was discussed during the PPO training phase. Although modifying the loss function can alleviate it, the problem remains unresolved.
At times, InstructGPT/ChatGPT may produce absurd outputs:Although InstructGPT/ChatGPT utilizes human feedback, the limited human resources still impact model performance significantly. The most substantial influence on model performance comes from supervised language modeling tasks, where human feedback serves primarily as a corrective measure. Therefore, due to the limitations of corrective data or the misleading nature of supervised tasks (which focus solely on model output without considering human intent), it is likely to generate unrealistic content. This is akin to a student who, despite having a teacher, may not grasp all knowledge points.
The model is highly sensitive to instructions:This can also be attributed to insufficient data for labeler annotations, as instructions are the sole clue for generating outputs. If the quantity and variety of instructions are inadequately trained, the model may exhibit this issue.
The model may overinterpret simple concepts:This may stem from labelers favoring longer output content when comparing generated results, leading to higher rewards for longer outputs.
Harmful instructions may yield harmful responses:For example, InstructGPT/ChatGPT may provide action plans in response to user prompts like “AI’s plan to destroy humanity” (see Figure 5). This occurs because InstructGPT/ChatGPT assumes that the instructions written by labelers are reasonable and value-aligned, without making more detailed judgments about user-provided prompts, resulting in the model potentially responding to any input. Although the subsequent reward model may assign lower reward values to such outputs, the model must consider both its value system and the alignment of generated content with the instructions, which sometimes leads to the generation of outputs that may conflict with those values.

Detailed Explanation of ChatGPT/InstructGPT

Figure 5: The plan for humanity’s destruction written by ChatGPT.

3.3 Future Work

Having analyzed the technical solutions and issues of InstructGPT/ChatGPT, we can also identify potential optimization angles for InstructGPT/ChatGPT.

Cost-effective human annotations: InstructGPT/ChatGPT employed a labeling team of 40, but based on the model’s performance, this team is insufficient. Finding ways for humans to provide more effective feedback and integrating human performance with model performance is crucial.
Model’s generalization and error correction capabilities regarding instructions: As instructions are the only clues for generating outputs, the model’s reliance on them is severe. Improving the model’s generalization ability regarding instructions and its ability to correct erroneous instructions is vital for enhancing user experience. This would not only expand the model’s application scenarios but also make it more “intelligent.”
Avoiding performance decline on general tasks: This may require designing a more reasonable approach to human feedback usage or developing more advanced model architectures. While many issues discussed regarding InstructGPT/ChatGPT can be resolved by providing more labeled data from labelers, this can lead to a more severe decline in performance on general NLP tasks. Thus, solutions are needed to balance the performance of generated results with general NLP tasks.

3.4 Hot Topic Responses from InstructGPT/ChatGPT

Will the emergence of ChatGPT lead to job losses for low-level programmers? Based on the principles of ChatGPT and the generated content circulating online, ChatGPT can generate code that often runs correctly. However, a programmer’s role extends beyond merely writing code; finding solutions to problems is paramount. Therefore, ChatGPT will not replace programmers, especially senior ones. Instead, it will serve as a valuable tool for programmers, similar to many current code generation tools.
Stack Overflow announces temporary rules: Ban on ChatGPT. ChatGPT is fundamentally a text generation model; compared to generating code, it is better at producing text that can easily be mistaken for real. The code or solutions generated by text generation models cannot guarantee they are executable or solve the problem at hand. However, the realistic nature of the generated text can confuse many users seeking solutions. Thus, Stack Overflow’s ban on ChatGPT is part of their quality control efforts.
Chatbot ChatGPT writes a “plan to destroy humanity” under prompting; what issues should we be aware of in AI development? ChatGPT’s “plan to destroy humanity” is generated based on prompts that it cannot foresee, fitting into the generated content based on vast data. Although these outputs may appear realistic and fluent, they merely indicate ChatGPT’s strong generation capabilities, not that it possesses any intent to destroy humanity. It is simply a text generation model, not a decision-making model.

4. Conclusion

Just like many algorithms at their inception, ChatGPT has garnered wide attention in the industry and sparked human contemplation about AI due to its effectiveness, authenticity, and harmlessness. However, upon examining its algorithmic principles, we find that it is not as terrifying as promoted in the industry. Instead, we can learn many valuable lessons from its technical solutions. The most significant contribution of InstructGPT/ChatGPT to the AI field is the clever integration of reinforcement learning and pre-trained models. Furthermore, through human feedback, it enhances the model’s usefulness, authenticity, and harmlessness. ChatGPT also raises the cost of large models; previously, the competition was based on data volume and model scale, but now it even includes the expenditure on hired outsourcing, making it more daunting for individual workers.

Source | JD Cloud Dolphin Data Science Laboratory

Recommended Reading——

【Focus on ChatGPT】 What is the core behind ChatGPT

【Focus on ChatGPT】 ChatGPT may soon replace you! Here are ten jobs it can do

【Focus on ChatGPT】 Insider: How OpenAI founder Sam Altman built the world’s hottest technology with billions from Microsoft

【Focus on ChatGPT】 Analyzing the working principles behind ChatGPT

Detailed Explanation of ChatGPT/InstructGPT

☞ Business Cooperation: ☏ Please call 010-82306118 / ✐ Or send an email to [email protected]

Click

Here “Read the original text” leads directly to the official website of Electronic Technology Applications

Leave a Comment Cancel reply