Understanding the Working Principles Behind ChatGPT

MLNLP community is a well-known machine learning and natural language processing community in China and abroad, covering audiences of NLP master’s and doctoral students, university teachers, and corporate researchers.The vision of the community is to promote communication and progress between the academic and industrial sectors of natural language processing and machine learning, especially for beginners. Reprinted from | Machine Heart Author | Marco Ramponi Editors | Wang Qiang, Dan Jiang Selected from | Assembly AI

Since the release of ChatGPT, it has attracted countless people to explore its workings. But how does ChatGPT actually work? Although the internal implementation details have not been disclosed, we can glimpse its basic principles from recent research.

ChatGPT is the latest language model released by OpenAI, with significant improvements over its predecessor GPT-3. Like many large language models, ChatGPT can generate text in different styles and for different purposes, and it performs better in accuracy, narrative detail, and contextual coherence. It represents OpenAI’s latest generation of large language models, with a strong focus on interactivity in its design.

OpenAI uses a combination of supervised learning and reinforcement learning to fine-tune ChatGPT, where the reinforcement learning component makes ChatGPT unique. OpenAI employs a training method called “Reinforcement Learning from Human Feedback” (RLHF), which uses human feedback during training to minimize unhelpful, distorted, or biased outputs.

This article will analyze the limitations of GPT-3 and the reasons arising from its training process, while explaining the principles of RLHF and understanding how ChatGPT uses RLHF to overcome the issues present in GPT-3, and finally explore the limitations of this approach.

Capabilities and Consistency in Large Language Models

“Consistency vs Capability” can be viewed as a more abstract analogy of “Accuracy vs Precision”.

In machine learning, the capability of a model refers to its ability to perform a specific task or set of tasks. The capability of a model is typically assessed by how well it can optimize its objective function. For example, a model used to predict stock market prices may have an objective function that measures the accuracy of the model’s predictions. If the model can accurately predict the changes in stock prices over time, it is considered to have high capability.

Consistency focuses on what we actually want the model to do, rather than what it was trained to do. It raises the question of whether the objective function aligns with expectations, based on how closely the model’s objectives and behaviors match human expectations. Suppose we want to train a bird classification model to classify birds as “sparrows” or “robins” using log loss as the training objective, while the ultimate goal is high classification accuracy. The model may have a low log loss, indicating strong capability, but poor accuracy on the test set, which is an example of inconsistency, where the model can optimize the training objective but is not aligned with the final goal.

The original GPT-3 is a non-consistent model. Large language models like GPT-3 are trained on a vast amount of text data from the internet, capable of generating human-like text, but they may not always produce outputs that meet human expectations. In fact, their objective function is a probability distribution over word sequences, used to predict what the next word in the sequence is.

However, in practical applications, the purpose of these models is to perform some form of valuable cognitive work, and there is a clear disparity between how these models are trained and how they are expected to be used. While mathematically, calculating the statistical distribution of word sequences may be an efficient choice for modeling language, humans actually generate language by selecting the text sequences that best fit the given context, using known background knowledge and common sense to assist in this process. When language models are used in applications that require a high degree of trust or reliability (such as conversational systems or intelligent personal assistants), this can be problematic.

Despite the immense power of these data-driven large models in recent years, they often fail to realize their potential when used in practical scenarios to help people live easier lives. The consistency issues in large language models typically manifest as:

Providing ineffective help: not following explicit user instructions.
Generating fabricated content: models that invent non-existent or incorrect facts.
Lack of interpretability: it is difficult for people to understand how the model arrived at a specific decision or prediction.
Harmful content bias: a language model trained on biased or harmful data may exhibit this in its outputs, even if it is not explicitly instructed to do so.

But specifically, where do consistency issues arise? Is the way language models are trained inherently prone to inconsistency?

How Language Model Training Strategies Lead to Inconsistency

Next-token prediction and masked language modeling are core techniques used for training language models. In the first method, the model is given a sequence of words as input and is asked to predict the next word in the sequence. If the model is provided with the input sentence:

“The cat sat on the”

It might predict the next word as “mat,” “chair,” or “floor,” as these words are highly probable given the preceding context; the language model can actually evaluate the likelihood of each possible word given the prior sequence.

The masked language modeling method is a variant of next-token prediction, where some words in the input sentence are replaced with a special token, such as [MASK]. The model is then asked to predict the correct word that should be inserted at the masked position. If given the sentence:

“The [MASK] sat on the”

It might predict that the word to fill the MASK position is “cat” or “dog.”

One of the advantages of these objective functions is that they allow the model to learn the statistical structure of the language, such as common word sequences and usage patterns. This often helps the model generate more natural and fluent text and is an important step in the pre-training phase of each language model.

However, these objective functions can also lead to issues primarily because the model cannot distinguish between important and unimportant errors. A very simple example is if the model is given the input sentence:

“The Roman Empire [MASK] with the reign of Augustus.”

It might predict that the MASK position should be filled with “began” or “ended,” as both words have high probabilities.

Generally, these training strategies may lead to inconsistencies in language models on some more complex tasks, as a model trained solely to predict the next word in a text sequence may not necessarily learn some higher-level representations of its meaning. Thus, the model may struggle to generalize to tasks that require a deeper understanding of language.

Researchers are exploring various methods to address the consistency issues in large language models. ChatGPT is based on the original GPT-3 model, but to address the model’s inconsistency issues, human feedback was used to guide the learning process, and it was further trained. The specific technique used is the aforementioned RLHF. ChatGPT is the first model to apply this technique in practical scenarios.

So how does ChatGPT utilize human feedback to solve consistency issues?

Reinforcement Learning from Human Feedback

The overall method consists of three different steps:

Supervised fine-tuning: The pre-trained language model is fine-tuned on a small amount of labeled data to learn a supervised strategy for generating outputs from a given list of prompts (i.e., SFT model);
Simulating human preferences: Annotators vote on a relatively large number of SFT model outputs, creating a new dataset consisting of comparative data. This dataset is used to train a new model, referred to as the Reward Model (RM);
Proximal Policy Optimization (PPO): The RM model is used for further fine-tuning and improving the SFT model, with the PPO outputs being the policy model.

Step 1 is done only once, while steps 2 and 3 can be continuously repeated: collecting more comparative data on the current best policy model to train a new RM model, then training a new policy. Next, the details of each step will be elaborated.

Step 1: Supervised Fine-Tuning Model

The first step is to collect data to train a supervised policy model.

Data collection: A list of prompts is selected, and annotators write down the expected outputs as required. For ChatGPT, two different sources of prompts were used: some were directly prepared by annotators or researchers, while others were obtained from OpenAI’s API requests (i.e., from GPT-3 users). Although the whole process is slow and expensive, the final result is a relatively small, high-quality dataset (about 12-15k data points) that can be used to fine-tune the pre-trained language model.
Model selection: The developers of ChatGPT chose a pre-trained model from the GPT-3.5 series, rather than fine-tuning the original GPT-3 model. The baseline model used is the latest text-davinci-003 (fine-tuned GPT-3 model through program code).

To create a general-purpose chatbot like ChatGPT, developers fine-tuned on a “code model” rather than a pure text model.

Due to the limited amount of data in this step, the SFT model obtained from the process may still output text that is not aligned with user concerns and often exhibits inconsistency issues. The problem here is that the supervised learning step has high scalability costs.

To overcome this issue, the strategy used was to have human annotators rank different outputs from the SFT model to create the RM model, rather than having human annotators create a larger curated dataset.

Step 2: Training the Reward Model

The goal of this step is to learn the objective function directly from the data. This function aims to score the outputs of the SFT model, representing how desirable these outputs are for humans. This strongly reflects the specific preferences of the selected human annotators and the common criteria they agree to follow. Ultimately, this process will yield a system that imitates human preferences from the data.

It works as follows:

Select a list of prompts, and the SFT model generates multiple outputs for each prompt (any value between 4 to 9);
Annotators rank the outputs from best to worst. The result is a new labeled dataset that is roughly 10 times the size of the precise dataset used for the SFT model;
This new data is used to train the RM model. This model takes the outputs of the SFT model as input and ranks them in order of preference.

For annotators, ranking outputs is much easier than labeling from scratch, and this process can scale more effectively. In practice, the number of prompts selected is around 30-40k, including different combinations of ranked outputs.

Step 3: Fine-Tuning the SFT Model Using PPO

In this step, reinforcement learning is applied to fine-tune the SFT model by optimizing the RM model. The specific algorithm used is called Proximal Policy Optimization (PPO), and the fine-tuned model is called the PPO model.

What is PPO? The main features of this algorithm are as follows:

PPO is an algorithm for training agents in reinforcement learning. It is called an “on-policy” algorithm because it learns and updates the current policy directly, rather than learning from past experiences like the “off-policy” algorithm DQN. PPO continuously adjusts the policy based on the actions taken by the agent and the rewards received;
PPO uses a “trust region optimization” method to train the policy, which limits the range of changes to the policy to a certain extent compared to the previous policy to ensure stability. This contrasts with other strategies that use gradient methods, which can sometimes make large updates to the policy, thereby compromising stability;
PPO uses value functions to estimate the expected return for a given state or action. The value function is used to calculate the advantage function, which represents the difference between expected and current returns. The advantage function is then used to update the policy by comparing the actions taken by the current policy with those that the previous policy would have taken. This allows PPO to make more informed updates to the policy based on the estimated value of the actions taken.

In this step, the PPO model is initialized by the SFT model, and the value function is initialized by the RM model. The environment is a “bandit environment” that generates random prompts and expects responses to those prompts. For a given prompt and response, it yields corresponding rewards (determined by the RM model). The SFT model adds a KL penalty factor to each token to avoid over-optimizing the RM model.

Performance Evaluation

Because the model is trained based on human-annotated inputs, the core part of the evaluation is also based on human input, by having annotators score the quality of the model’s outputs. To avoid overfitting the judgments of the annotators involved in the training phase, the test set uses prompts from other OpenAI clients that did not appear in the training data.

The model is evaluated based on three criteria:

Helpfulness: Assessing the model’s ability to follow user instructions and infer instructions.
Truthfulness: Assessing the model’s tendency to generate fabricated facts in closed-domain tasks.
Harmlessness: Evaluators assess whether the model’s outputs are appropriate and whether they contain discriminatory content.

The model was also evaluated for zero-shot learning performance on traditional NLP tasks (such as question answering, reading comprehension, and summarization), with developers finding that the model’s performance on some of these tasks was slightly worse than GPT-3, which is an example of an “alignment tax,” where the consistency procedures based on human feedback reinforcement learning come at the cost of reducing performance on certain tasks.

The performance regression on these datasets can be significantly reduced using a technique called pre-training mixing: during the gradient descent training of the PPO model, gradients are calculated by mixing the gradients of the SFT model and the PPO model.

Limitations of the Approach

One very obvious limitation of this method is that the data used for fine-tuning the model to align the language model with human intentions can be influenced by various intricate subjective factors, mainly including:

The preferences of human annotators generating demo data;
The researchers designing the study and writing labeling instructions;
The prompts selected by developers or provided by OpenAI clients;
Annotator bias is present both in the training of the RM model and in the evaluation of the model.

The authors of ChatGPT also acknowledge the obvious fact that the annotators and researchers involved in the training process may not fully represent all potential end-users of the language model.

In addition to this obvious “endogenous” limitation, there are other drawbacks and issues that need to be addressed:

Lack of control studies: Reported results measure the performance of the final PPO model against the SFT model baseline. This can be misleading: how do we know these improvements are due to RLHF? Therefore, control studies are necessary, including spending the same amount of time on labeling as was used to train the RM model to create a larger curated supervised fine-tuning dataset with high-quality data. This would allow for an objective measurement of performance improvements of the RLHF method compared to supervised methods. Simply put, the lack of such control studies leaves a fundamental question completely unresolved: Does RLHF really perform well in consistent language models?
Comparative data lacks basic facts: Annotators often hold differing opinions on the rankings of model outputs. Technically, the risk produced is that a large variance is added to the comparative data without any basic facts.
Human preferences are not homogeneous: The RLHF method assumes human preferences are homogeneous and static. Assuming all people share the same values is clearly inaccurate; although there are many common values, humans still have many differing cognitions on many matters.
RM model prompt stability testing: There is no experiment showing the sensitivity of the RM model to changes in input prompts. If two prompts are syntactically different but semantically equivalent, can the RM model show significant differences in the rankings of model outputs? How important is the quality of prompts to the RM?
Other issues: In RL methods, models can sometimes learn to manipulate their RM models to achieve desired results, leading to “over-optimized policies.” This may cause the model to recreate certain patterns for some unknown reasons that yield higher scores in the RM model. ChatGPT has addressed this issue by using a KL penalty term in the RM function.

About Us

MLNLP community is a civil academic community jointly built by machine learning and natural language processing scholars at home and abroad. It has developed into a well-known machine learning and natural language processing community, aiming to promote progress between the academic and industrial sectors of machine learning and natural language processing and the vast community of enthusiasts.The community can provide an open communication platform for practitioners’ further study, employment, and research. Everyone is welcome to follow and join us.

About Us

Leave a Comment Cancel reply