Source: JD Cloud Dolphin Data Science Laboratory
This article is about 7000 words, recommended reading time is 15 minutes.
To understand ChatGPT, we must first understand InstructGPT.
Introduction
The GPT series is a series of pre-trained models by OpenAI, where GPT stands for Generative Pre-Trained Transformer. As the name suggests, the purpose of GPT is to use a Transformer-based model and pre-training techniques to obtain a general text model. So far, published papers include the text pre-trained GPT-1, GPT-2, GPT-3, and the image pre-trained model iGPT. It is rumored that the yet-to-be-released GPT-4 is a multimodal model. Recently, the very popular ChatGPT and the one released earlier this year are sister models, sometimes referred to as GPT3.5. ChatGPT and InstructGPT have identical model structures and training methods, both utilizing Instruction Learning and Reinforcement Learning from Human Feedback (RLHF) to guide model training. The only difference lies in the way data is collected. Therefore, to understand ChatGPT, we must first understand InstructGPT.
1. Background Knowledge
Before introducing ChatGPT/InstructGPT, let’s first discuss the foundational algorithms they rely on.
1.1 The GPT Series
The three generations of models based on text pre-training, GPT-1, GPT-2, and GPT-3, utilize models centered around the Transformer structure (Figure 1). The differences lie in the number of layers, word vector lengths, and other hyperparameters, as detailed in Table 1.

Model |
Release Date |
Layers |
Heads |
Word Vector Length |
Parameter Count |
Pre-training Data Volume |
GPT-1 |
June 2018 |
12 |
12 |
768 |
117 million |
About 5GB |
GPT-2 |
February 2019 |
48 |
– |
1600 |
1.5 billion |
40GB |
GPT-3 |
May 2020 |
96 |
96 |
12888 |
175 billion |
45TB |
GPT-1 was released a few months earlier than BERT. Both utilize the Transformer as their core structure, but GPT-1 was constructed through a left-to-right generative pre-training task to obtain a general pre-trained model, which, like BERT, can be fine-tuned for downstream tasks. GPT-1 achieved state-of-the-art results on nine NLP tasks, but its model size and data volume were relatively small, prompting the development of GPT-2.
Compared to GPT-1, GPT-2 did not significantly alter the model structure but used a model with more parameters and more training data (Table 1). The most important idea of GPT-2 was the proposition that all supervised learning is a subset of unsupervised language models, which also laid the groundwork for Prompt Learning. GPT-2 caused quite a stir upon its release, as the news it generated was enough to deceive most humans, achieving a level of realism that led to it being called the “most dangerous weapon in AI.” Many major websites even ordered a ban on the use of news generated by GPT-2.
When GPT-3 was proposed, aside from its performance far exceeding that of GPT-2, the more significant discussion revolved around its 175 billion parameters. In addition to completing common NLP tasks, researchers unexpectedly found that GPT-3 performed well in writing code in SQL, JavaScript, and other languages, as well as performing simple mathematical operations. The training of GPT-3 utilized in-context learning, a form of meta-learning, which aims to find a suitable initialization range through a small amount of data, enabling the model to fit quickly on limited datasets and achieve good results.
From the analysis above, we can see that GPT has two performance goals:
-
Improve the model’s performance on common NLP tasks;
-
Enhance the model’s generalization ability in other atypical NLP tasks (such as code writing, mathematical operations).
Additionally, since the inception of pre-trained models, a heavily criticized issue has been the bias inherent in these models. Pre-trained models are trained on massive datasets with large parameter models, and compared to expert systems fully controlled by human rules, pre-trained models resemble a black box. No one can guarantee that pre-trained models will not generate dangerous content, such as racial or gender discrimination, because their training data, often in the tens of GBs or even TBs, almost certainly contains similar samples. This is the motivation behind the development of InstructGPT and ChatGPT, whose optimization objectives are summarized in the paper with the acronym 3H:
-
Helpful;
-
Honest;
-
Harmless.
The GPT series of models from OpenAI has not been open-sourced, but they provide a trial website for the models, which interested individuals can use.
1.2 Instruction Learning and Prompt Learning
Instruction learning is a concept proposed by Quoc V. Le’s team at Google DeepMind in a 2021 paper titled “Finetuned Language Models Are Zero-Shot Learners”. Both instruction learning and prompt learning aim to tap into the knowledge inherent in language models. The difference is that prompts stimulate the language model’s completion ability, such as generating the second half of a sentence based on the first half or completing a fill-in-the-blank task. Instructing stimulates the language model’s understanding ability by giving clearer instructions for the model to take correct actions. We can understand these two different learning methods through the following examples:
-
Prompt Learning: I bought this necklace for my girlfriend, she really likes it, this necklace is too ____.
-
Instruction Learning: Determine the sentiment of this sentence: I bought this necklace for my girlfriend, she really likes it. Options: A=Good; B=Average; C=Bad.
The advantage of instruction learning is that after multi-task fine-tuning, it can also perform zero-shot on other tasks, while prompt learning is task-specific and has less generalization ability than instruction learning. We can understand the differences among fine-tuning, prompt learning, and instruction learning through Figure 2.
1.3 Reinforcement Learning from Human Feedback
The models trained are not very controllable; they can be viewed as fitting the distribution of the training set. Therefore, when feedback is fed back into the generative model, the distribution of the training data becomes the most critical factor affecting the quality of the generated content. Sometimes we want the model to be influenced not just by the training data but to be human-controllable, ensuring the usefulness, authenticity, and harmlessness of the generated data. The paper frequently mentions the alignment problem, which we can understand as aligning the model’s output with the outputs preferred by humans. What humans prefer includes not only the fluency and grammatical correctness of the generated content but also the usefulness, authenticity, and harmlessness of the content generated.
We know that reinforcement learning guides model training through a reward mechanism, which can be viewed as the loss function of traditional model training. The calculation of rewards is more flexible and diverse than that of loss functions (for example, the reward for AlphaGO is the win/loss of the game), but the trade-off is that the calculation of rewards is non-differentiable, making it unsuitable for direct backpropagation. The idea of reinforcement learning is to fit the loss function through extensive sampling of rewards, thus enabling model training. Similarly, human feedback is also non-differentiable, so we can treat human feedback as reinforcement learning rewards, leading to the emergence of reinforcement learning from human feedback (RLHF).
RLHF can be traced back to a paper published by Google in 2017, titled “Deep Reinforcement Learning from Human Preferences,” which improved the performance of reinforcement learning in simulated robotics and Atari games by using human annotations as feedback.
Figure 3: Basic Principle of Reinforcement Learning from Human Feedback
In InstructGPT/ChatGPT, a classic algorithm from reinforcement learning is also used: the Proximal Policy Optimization (PPO) proposed by OpenAI. The PPO algorithm is a new type of Policy Gradient algorithm that is very sensitive to step size but difficult to choose an appropriate step size. If the difference between the new and old policies during training is too large, it hinders learning. PPO proposes a new objective function that allows small batch updates over multiple training steps, solving the problem of determining step size in Policy Gradient algorithms. In fact, TRPO was also designed to address this issue, but compared to TRPO, the PPO algorithm is easier to solve.
2. Understanding the Principles of InstructGPT/ChatGPT
With the foundational knowledge above, understanding InstructGPT and ChatGPT becomes much simpler. In short, InstructGPT/ChatGPT adopts the structure of GPT-3 and trains a reward model (RM) that reflects the predicted content effect through instruction learning to guide the training of the reinforcement learning model. The training process of InstructGPT/ChatGPT is illustrated in Figure 4.
Figure 4: Computational Process of InstructGPT: (1) Supervised Fine-Tuning (SFT); (2) Reward Model (RM) Training; (3) Reinforcement Learning based on PPO using the Reward Model.
From Figure 4, we can see that the training of InstructGPT/ChatGPT can be divided into three steps, with the second and third steps being iteratively optimized for the reward model and the reinforcement learning SFT model.
-
Fine-tune GPT-3 with supervised methods based on the collected SFT dataset;
-
Collect human-annotated comparative data to train the reward model (Reword Model, RM);
-
Use RM as the optimization target for reinforcement learning, fine-tuning the SFT model with the PPO algorithm.
Based on Figure 4, we will introduce the data collection and model training aspects of InstructGPT/ChatGPT.
2.1 Data Collection
As shown in Figure 4, the training of InstructGPT/ChatGPT is divided into three steps, each requiring slightly different data. Below, we will introduce them respectively.
2.1.1 SFT Dataset
The SFT dataset is used to train the first step of the supervised model, fine-tuning GPT-3 using newly collected data according to GPT-3’s training methods. Since GPT-3 is a generative model based on prompt learning, the SFT dataset consists of prompt-response pairs. Part of the SFT data comes from users using OpenAI’s Playground, and another part comes from 40 labelers hired by OpenAI. They were trained for this task. In this dataset, the labelers’ job is to generate instructions based on the content and ensure that the instructions meet the following three points:
-
Simple Tasks:Labelers provide any simple task while ensuring task diversity;
-
Few-shot Tasks:Labelers provide one instruction along with multiple query-response pairs for that instruction;
-
User-Relevant:Labelers obtain use cases from the interface and create instructions based on those use cases.
2.1.2 RM Dataset
The RM dataset is used to train the reward model in the second step, setting a reward target for the training of InstructGPT/ChatGPT. This reward target does not need to be differentiable but must comprehensively and authentically align with the content we need the model to generate. Naturally, we can provide this reward through human annotations, giving lower scores to generated content that involves bias to encourage the model not to generate content that humans dislike. InstructGPT/ChatGPT’s approach is to first generate a batch of candidate texts and then have labelers rank these generated contents based on their quality.
2.1.3 PPO Dataset
The PPO dataset for InstructGPT is not labeled; it comes entirely from users of the GPT-3 API. It includes various types of generation tasks provided by different users, with the highest proportions being generation tasks (45.6%), QA (12.4%), brainstorming (11.2%), and dialogue (8.4%).
2.1.4 Data Analysis
Since InstructGPT/ChatGPT is fine-tuned based on GPT-3 and involves human annotations, the total amount of data is not large. Table 2 shows the sources and data volumes of the three datasets.
Table 2: Data Distribution of InstructGPT
The appendix A of the paper discusses the data distribution in more detail. Here, I list a few factors that may affect the model’s performance:
-
Over 96% of the data is in English, while the other 20 languages, including Chinese, French, and Spanish, account for less than 4% combined. This may lead to InstructGPT/ChatGPT being able to generate in other languages, but the performance is likely far inferior to English;
-
There are nine types of prompts, and the vast majority are generation tasks, which may lead to the model not covering all task types;
-
40 outsourced employees are concentrated in the U.S. and Southeast Asia, and the small number may generate some bias and discrimination issues that are more concerning in other regions, as InstructGPT/ChatGPT aims to train a pre-trained model with correct values, based on the combined values of these 40 outsourced employees.
Additionally, the ChatGPT blog mentions that ChatGPT and InstructGPT have the same training methods, with the only difference being in data collection, but no further details are available on the differences in data collection. Considering that ChatGPT is used solely in dialogue, I speculate that ChatGPT has two differences in data collection: 1. An increase in the proportion of dialogue tasks; 2. Transformation of prompts into Q&A format. Of course, this is merely speculation, and more accurate descriptions will await the release of ChatGPT’s paper, source code, and other detailed materials.
2.2 Training Tasks
We just introduced the three-step training method of InstructGPT/ChatGPT. These three steps involve three models: SFT, RM, and PPO, which we will detail below.
2.2.1 Supervised Fine-Tuning (SFT)
This training step is consistent with GPT-3, and the authors found that allowing the model to overfit slightly helps the training of the subsequent two steps.
2.2.2 Reward Model (RM)
Since the data for training RM is sorted by labelers based on generated results, it can be viewed as a regression model. The RM structure is the model after removing the last embedding layer from the SFT-trained model. Its input is the prompt and response, and its output is the reward value. Specifically, for each prompt, InstructGPT/ChatGPT randomly generates K outputs (4≤K≤9), and then presents these outputs in pairs to each labeler. This means that for each prompt, CK2 results are shown, and users select the better output. During training, InstructGPT/ChatGPT treats the CK2 response pairs for each prompt as a batch. This batch training method by prompt is less prone to overfitting than the traditional sample-based batch method, as each prompt is input into the model only once.
The loss function for the reward model is expressed as Equation (1). The goal of this loss function is to maximize the difference between the responses preferred by labelers and those they do not prefer.
(1) loss(θ) = -1/(K2) E(x, yw, yl)∼D[log(σ(rθ(x, yw) – rθ(x, yl)))]
where rθ(x, y) is the reward value of prompt x and response y under the reward model with parameters θ, yw is the response preferred by the labeler, and yl is the response not preferred by the labeler. D is the entire training dataset.
2.2.3 Reinforcement Learning Model (PPO)
Reinforcement learning and pre-trained models have been two of the hottest AI directions in recent years. Many researchers previously argued that reinforcement learning is not very suitable for application in pre-trained models because it is challenging to establish a reward mechanism based on the model’s output. InstructGPT/ChatGPT counterintuitively achieved this by integrating human annotations, which is the most significant innovation of this algorithm.
As shown in Table 2, the training set for PPO comes entirely from the API. It uses the reward model obtained from the second step to guide the continued training of the SFT model. Often, reinforcement learning is very challenging to train, and InstructGPT/ChatGPT faced two problems during training:
-
Problem 1: As the model updates, the data generated by the reinforcement learning model increasingly diverges from the data used to train the reward model. The authors’ solution was to add a KL penalty term βlog(πϕRL(y|x)/πSFT(y|x)) to the loss function to ensure that the PPO model’s output does not deviate significantly from the SFT output.
-
Problem 2: Training solely with the PPO model can lead to a significant drop in performance on common NLP tasks. The authors’ solution was to include a general language model objective γE_x∼Dpretrain [log(πϕRL(x))] in the training objective, which was referred to as PPO-ptx in the paper.
In summary, the training objective for PPO is expressed as Equation (2). (2) objective(ϕ) = E(x,y)∼DπϕRL[rθ(x,y) – βlog(πϕRL(y|x)/πSFT(y|x))] + γE_x∼Dpretrain [log(πϕRL(x))]
3. Performance Analysis of InstructGPT/ChatGPT
It is undeniable that InstructGPT/ChatGPT performs exceptionally well, especially after the introduction of human annotations, which significantly enhances the model’s correctness of values and the authenticity of human behavior patterns. Based solely on the technical solutions and training methods of InstructGPT/ChatGPT, we can analyze the improvements it can bring.
3.1 Advantages
-
InstructGPT/ChatGPT’s performance is more authentic than that of GPT-3:This is easy to understand, as GPT-3 already possesses strong generalization and generative capabilities. With the addition of different labelers for prompt creation and result ranking, along with fine-tuning on top of GPT-3, we have higher rewards for more authentic data when training the reward model. The authors also compared their performance with GPT-3 on the TruthfulQA dataset, and experimental results showed that even the smaller 1.3 billion PPO-ptx model outperformed GPT-3.
-
InstructGPT/ChatGPT shows slight improvements in harmlessness compared to GPT-3:The principle is the same as above. However, the authors found that InstructGPT did not show significant improvements on datasets involving discrimination and bias. This is because GPT-3 is already a very effective model, and the probability of generating problematic samples containing harmful, discriminatory, or biased content is already low. Merely using data collected and annotated by 40 labelers may not sufficiently optimize the model in these aspects, leading to minimal or unnoticeable improvements.
-
InstructGPT/ChatGPT possesses strong coding abilities:First, GPT-3 already has strong coding abilities, and the API based on GPT-3 has accumulated a large amount of coding data. Some OpenAI employees also participated in data collection. Thus, it is not surprising that InstructGPT/ChatGPT, trained with substantial coding-related data and human annotations, exhibits exceptional coding capabilities.
3.2 Disadvantages
-
InstructGPT/ChatGPT may reduce performance on general NLP tasks:We discussed this issue during the training of PPO; while modifying the loss function can alleviate it, the problem has not been thoroughly resolved.
-
Sometimes InstructGPT/ChatGPT produces absurd outputs:Although InstructGPT/ChatGPT uses human feedback, the limited human resources still significantly impact model performance. Human feedback primarily serves a corrective role, and due to the limited nature of correction data or the misleading nature of supervised tasks (focusing solely on the model’s output without considering what humans want), it is likely to generate unrealistic content. Similar to a student, even with a teacher’s guidance, it cannot be guaranteed that the student will learn all knowledge points.
-
The model is highly sensitive to instructions:This can also be attributed to the insufficient amount of labeled data, as instructions are the sole clues for generating output. If the quantity and variety of instructions are not adequately trained, the model may exhibit this issue.
-
The model may overinterpret simple concepts:This could be due to the labelers tending to give longer outputs higher rewards during the comparison of generated content.
-
Harmful instructions may lead to harmful responses:For example, InstructGPT/ChatGPT may provide an action plan in response to a user’s request for an “AI plan to destroy humanity” (Figure 5). This is because InstructGPT/ChatGPT assumes that the instructions written by labelers are reasonable and possess correct values, without making a more detailed judgment on the instructions provided by users, leading the model to respond to any input. While subsequent reward models may assign lower rewards to such outputs, during text generation, the model must consider both its values and the alignment of the generated content with the instructions, which may sometimes lead to generating outputs with problematic values.
Figure 5: The plan for destroying humanity generated by ChatGPT.
3.3 Future Work
Having analyzed the technical solutions and issues of InstructGPT/ChatGPT, we can also identify potential optimization angles for InstructGPT/ChatGPT.
-
Reducing costs and increasing efficiency of human annotations:InstructGPT/ChatGPT employs a labeling team of 40 people, but the performance suggests that this team is insufficient. Finding ways for humans to provide more effective feedback and cleverly combining human performance with model performance is crucial.
-
Improving the model’s generalization and error correction abilities regarding instructions:As instructions are the sole clues for generating outputs, the model’s dependence on them is significant. Enhancing the model’s ability to generalize from instructions and correct errors in incorrect instructions is a vital task for improving user experience. This not only allows the model to have broader application scenarios but also makes it “smarter”.
-
Avoiding performance decline in general tasks:This may require designing a more reasonable approach for human feedback or more advanced model structures. Many of the issues we discussed regarding InstructGPT/ChatGPT could be resolved by providing more labeled data, but this could lead to a more severe decline in performance on general NLP tasks. Therefore, solutions are needed to balance the 3H of generated results with the performance of general NLP tasks.
3.4 Addressing Hot Topics Regarding InstructGPT/ChatGPT
-
Will the emergence of ChatGPT lead to job loss for junior programmers?Based on the principles of ChatGPT and the generated content leaked online, ChatGPT generates code that can often run correctly. However, a programmer’s job is not just about writing code; more importantly, it is about finding solutions to problems. Therefore, ChatGPT will not replace programmers, especially senior programmers. On the contrary, like many current code generation tools, it will become a very useful tool for programmers in writing code.
-
Stack Overflow announces temporary rules:Ban ChatGPT. ChatGPT is fundamentally a text generation model. Compared to generating code, it is better at generating text that appears realistic. However, the code or solutions generated by text generation models cannot guarantee they are runnable or solve the problem, but their realistic text can confuse many people querying the issue. To maintain the quality of the forum, Stack Overflow’s ban on ChatGPT is a necessary cleanup.
-
Chatbot ChatGPT wrote a “plan to destroy humanity” under inducement, what issues should AI development pay attention to?ChatGPT’s “plan to destroy humanity” is generated content based on massive data fitting under unforeseen instructions. Although this content appears realistic and is expressed fluently, it only demonstrates ChatGPT’s strong generative capabilities and does not imply that ChatGPT possesses the intention to destroy humanity. It is merely a text generation model, not a decision-making model.
4. Conclusion
Just like many algorithms when they were first born, ChatGPT has attracted widespread attention in the industry and provoked human reflection on AI due to its usefulness, authenticity, and harmlessness. However, after examining its algorithmic principles, we find that it is not as terrifying as advertised in the industry. Instead, we can learn many valuable lessons from its technical solutions. The most significant contribution of InstructGPT/ChatGPT in the AI field is the clever combination of reinforcement learning and pre-trained models. Moreover, by incorporating human feedback, it has enhanced the model’s usefulness, authenticity, and harmlessness. ChatGPT has also further increased the costs of large models; previously, the competition was merely about data volume and model scale, but now it even involves hiring outsourced workers, making it even more daunting for individual workers.
This article is adapted from the WeChat public account “Turing Artificial Intelligence”.
More Exciting:
Call for Papers
Yan Shi│Reflections and Suggestions on the “Dilemma” of Young Teachers in Higher Education
Xiao Fei et al.│The Metaverse Education and Its Service Ecosystem
【Directory】”Computer Education” 2022 No. 12
【Directory】”Computer Education” 2022 No. 11
【Directory】”Computer Education” 2022 No. 10
【Editorial Committee Message】Professor Li Xiaoming from Peking University: Reflections on the “Year of Classroom Teaching Improvement”…
Professor Chen Daoxu from Nanjing University: Teaching students to ask questions and teaching students to answer questions, which is more important?
【Yan Shi Series】: Development Trends in Computer Science and Their Impact on Computer Education
Professor Li Xiaoming from Peking University: From Fun Mathematics to Fun Algorithms to Fun Programming—A Path for Non-professional Learners to Experience Computational Thinking?
Reflections on Several Issues in Building a First-Class Computer Science Discipline
New Engineering and Big Data Professional Construction
Learning from Others—Compilation of Research Articles on Computer Education at Home and Abroad