Understanding the Mechanism Behind ChatGPT

Since the release of ChatGPT, it has attracted countless people to explore its workings.But how does ChatGPT actually work?Although the internal implementation details have not been disclosed, we can glimpse its fundamental principles from recent research.

ChatGPT is the latest language model released by OpenAI, showing significant improvements over its predecessor GPT-3. Like many large language models, ChatGPT can generate text in different styles and for different purposes, and it performs better in terms of accuracy, narrative details, and contextual coherence. It represents the latest generation of large language models from OpenAI, with a strong emphasis on interactivity in its design.

OpenAI uses a combination of supervised learning and reinforcement learning to fine-tune ChatGPT, with the reinforcement learning component making ChatGPT unique. OpenAI employs a training method called “Reinforcement Learning from Human Feedback” (RLHF), which uses human feedback during training to minimize unhelpful, distorted, or biased outputs.

This article will analyze the limitations of GPT-3 and the reasons behind its training process, explain the principles of RLHF, and understand how ChatGPT uses RLHF to overcome the problems present in GPT-3, and finally explore the limitations of this approach.

Capabilities and Consistency in Large Language Models

“Consistency vs Capability” can be seen as a more abstract analogy of “Accuracy vs Precision”.

In machine learning, the capability of a model refers to its ability to perform a specific task or a set of tasks. The capability of a model is often assessed by how well it can optimize its objective function. For example, a model used to predict stock market prices may have an objective function that measures the accuracy of its predictions. If the model can accurately predict how stock prices change over time, it is considered to have high execution capability.

Consistency focuses on what we actually want the model to do, rather than what it was trained to do. It raises the question of whether the objective function aligns with expectations, based on how well the model’s goals and behaviors align with human expectations. Suppose we want to train a bird classifier to classify birds as “sparrows” or “robins,” using log loss as the training objective, while the ultimate goal is high classification accuracy. The model may have a low log loss, indicating strong capability, but perform poorly on the test set, which is an example of inconsistency where the model can optimize the training objective but is misaligned with the final goal.

The original GPT-3 is a non-consistent model. Large language models like GPT-3 are trained on vast amounts of text data from the internet, capable of generating human-like text, but they may not always produce outputs that align with human expectations. In fact, their objective function is a probability distribution over sequences of words, used to predict what the next word in the sequence is.

However, in practical applications, the purpose of these models is to perform some form of valuable cognitive work, and there is a significant difference between how these models are trained and how they are expected to be used. While from a mathematical perspective, statistically computing word sequences may be an efficient choice for modeling language, humans actually generate language by selecting the text sequences that best fit the given context, aided by known background knowledge and common sense. This can be problematic when language models are used in applications that require a high degree of trust or reliability (such as dialogue systems or intelligent personal assistants).

Despite the immense power of these data-driven large models in recent years, they often fail to realize their potential when used in practical applications to help make life easier for people. The consistency issues in large language models often manifest as:

Providing ineffective assistance: not following explicit user instructions.
Content fabrication: models inventing non-existent or erroneous facts.
Lack of interpretability: it is difficult for people to understand how the model arrived at a specific decision or prediction.
Harmful content bias: a language model trained on biased, harmful data may exhibit this in its outputs, even if it was not explicitly instructed to do so.

But specifically, where do consistency issues originate? Is the training method of language models inherently prone to inconsistencies?

How do language model training strategies produce inconsistencies?

Next-token prediction and masked language modeling are core techniques used for training language models. In the first method, the model is given a sequence of words as input and is asked to predict the next word in the sequence. If the model is provided with the input sentence:

“The cat sat on the”

It might predict the next word as “mat,” “chair,” or “floor,” as these words have high probabilities in the preceding context; the language model can actually evaluate the likelihood of each possible word given the previous sequence.

The masked language modeling method is a variant of next-token prediction, where some words in the input sentence are replaced with a special token, such as [MASK]. The model is then asked to predict the correct word that should be inserted at the mask position. If given a sentence:

“The [MASK] sat on the “

It might predict that the word to fill the MASK position is “cat” or “dog.”

One of the advantages of these objective functions is that they allow the model to learn the statistical structure of language, such as common word sequences and word usage patterns. This generally helps the model generate more natural and fluent text and is an important step in the pre-training phase of each language model.

However, these objective functions can also lead to issues, primarily because the model cannot distinguish between important and unimportant errors. A very simple example is if the model is given the input sentence:

“The Roman Empire [MASK] with the reign of Augustus.”

It might predict that the MASK position should be filled with “began” or “ended,” as both words have high probabilities.

In general, these training strategies may lead to inconsistencies in language models on some more complex tasks, as a model trained solely to predict the next word in a text sequence may not necessarily learn some higher-level representations of its meaning. Therefore, the model may struggle to generalize to tasks that require a deeper understanding of language.

Researchers are exploring various methods to address consistency issues in large language models. ChatGPT is based on the initial GPT-3 model, but to address the model’s inconsistency issues, human feedback was used to guide the learning process, and it was further trained. The specific technique used is the aforementioned RLHF. ChatGPT is the first model to apply this technique in practical scenarios.

So how does ChatGPT leverage human feedback to solve consistency issues?

Reinforcement Learning from Human Feedback

The overall method consists of three different steps:

Supervised fine-tuning: The pre-trained language model is fine-tuned on a small amount of labeled data to learn a supervised strategy for generating outputs from a given list of prompts (i.e., SFT model);
Simulating human preferences: Annotators vote on relatively large outputs from the SFT model, creating a new dataset composed of comparative data. A new model is trained on this dataset, referred to as the Reward Model (RM);
Proximal Policy Optimization (PPO): The RM model is used to further fine-tune and improve the SFT model, with PPO outputting the resulting policy.

Step 1 is performed only once, while Steps 2 and 3 can be repeated continuously: collecting more comparative data on the current best policy model to train a new RM model, and then training a new policy. Next, the details of each step will be elaborated.

Step 1: Supervised Fine-tuning Model

The first step is to collect data to train the supervised policy model.

Data collection: A list of prompts is selected, and annotators write down the expected outputs as required. For ChatGPT, two different sources of prompts were used: some were directly prepared by annotators or researchers, while others were obtained from requests to OpenAI’s API (i.e., from GPT-3 users). Although the entire process is slow and expensive, the final result is a relatively small and high-quality dataset (about 12-15k data points) that can be used to fine-tune the pre-trained language model.
Model selection: Developers of ChatGPT chose a pre-trained model from the GPT-3.5 series instead of fine-tuning the original GPT-3 model. The baseline model used is the latest text-davinci-003 (fine-tuned from the GPT-3 model).

To create a general chatbot like ChatGPT, developers fine-tuned on top of a “code model” rather than a pure text model.

Due to the limited amount of data in this step, the SFT model obtained may still output text that does not align with user concerns, and inconsistencies often arise. The issue here is that the supervised learning step has a high scalability cost.

To overcome this problem, the strategy used is to have human annotators rank different outputs from the SFT model to create the RM model instead of having them create a larger curated dataset.

Step 2: Training the Reward Model

The goal of this step is to learn the objective function directly from the data. This function is meant to score the outputs of the SFT model, representing how desirable these outputs are for humans. This strongly reflects the specific preferences of selected human annotators and the common criteria they agree to follow. Ultimately, this process will yield a system that mimics human preferences from the data.

It works as follows:

Select a list of prompts, and the SFT model generates multiple outputs (any value between 4 to 9) for each prompt;
Annotators rank the outputs from best to worst. The result is a new labeled dataset that is about 10 times the size of the precise dataset used for the SFT model;
This new data is used to train the RM model. This model takes the outputs of the SFT model as input and ranks them in order of preference.

For annotators, ranking outputs is much easier than labeling from scratch, and this process can scale more effectively. In practice, the number of selected prompts is around 30-40k, and includes different combinations of ranked outputs.

Step 3: Fine-tuning the SFT Model Using PPO

This step applies reinforcement learning to fine-tune the SFT model by optimizing the RM model. The specific algorithm used is called Proximal Policy Optimization (PPO), and the fine-tuned model is referred to as the PPO model.

What is PPO? The main features of this algorithm are as follows:

PPO is an algorithm for training agents in reinforcement learning. It is called an “on-policy” algorithm because it learns and updates the current policy directly, rather than learning from past experiences like DQN’s “off-policy” algorithm. PPO continuously adjusts the policy based on the actions taken by the agent and the rewards received;
PPO uses a “trust region optimization” method to train the policy, which limits the range of changes to the policy to a certain extent relative to the previous policy to ensure stability. This contrasts sharply with other policies that use gradient methods, which can sometimes make large-scale updates to the policy, disrupting its stability;
PPO uses a value function to estimate the expected return for a given state or action. The value function is used to calculate the advantage function, which represents the difference between expected and current returns. The advantage function is then used to update the policy by comparing the actions taken by the current policy with those that would have been taken by the previous policy. This allows PPO to make more informed updates to the policy based on the estimated value of the actions taken.

In this step, the PPO model is initialized by the SFT model, and the value function is initialized by the RM model. The environment is a “bandit environment” that produces random prompts and expects responses to those prompts. For each given prompt and response, it generates corresponding rewards (determined by the RM model). The SFT model adds a KL penalty factor to each token to avoid over-optimizing the RM model.

Performance Evaluation

Because the model is trained based on human-labeled inputs, the core part of the evaluation is also based on human inputs, which is done by having annotators score the quality of the model’s outputs. To avoid overfitting the judgments of annotators involved in the training phase, the test set uses prompts from other OpenAI clients that did not appear in the training data.

The model is evaluated based on three criteria:

Helpfulness: assessing the model’s ability to follow user instructions and infer instructions.
Truthfulness: assessing the model’s tendency to produce fabricated facts in closed-domain tasks.
Harmlessness: annotators evaluate whether the model’s outputs are appropriate and whether they contain discriminatory content.

The model is also evaluated for zero-shot learning performance on traditional NLP tasks (such as question answering, reading comprehension, and summarization), and developers found that in some of these tasks, the model performed slightly worse than GPT-3, which is an example of an “alignment tax,” where the consistency program based on reinforcement learning from human feedback comes at the cost of lowering performance on certain tasks.

The performance regression on these datasets can be significantly reduced by a technique called pre-training mixing: during gradient descent training of the PPO model, gradients are computed by mixing the gradients of the SFT model and the PPO model.

Drawbacks of the Method

A very obvious limitation of this method is that the data used for fine-tuning the model to align with human intentions is influenced by various complex subjective factors, primarily including:

Preferences of human annotators who generate demo data;
Researchers who design studies and write labeling instructions;
Prompts selected from those created by developers or provided by OpenAI clients;
Annotator biases that are included in both RM model training and model evaluation.

The authors of ChatGPT also acknowledge the obvious fact that the annotators and researchers involved in the training process may not fully represent all potential end users of the language model.

In addition to this apparent “endogenous” limitation, there are several other drawbacks and issues that need to be addressed:

Lack of control studies: Reported results measure the performance of the final PPO model against the SFT model baseline. This can be misleading: how do we know these improvements are due to RLHF? Therefore, control studies are crucial, including investing the same amount of time used to train the RM model to create a larger curated supervised fine-tuning dataset with high-quality data. This would allow for an objective measurement of the performance improvements of the RLHF method compared to supervised methods. In simple terms, the lack of such control studies leaves a fundamental question entirely unresolved: Does RLHF really work well in consistent language models?
Comparative data lacks basic facts: Annotators often have differing opinions on the rankings of model outputs. Technically, the risk generated is that without any basic facts, the comparative data adds significant variance.
Human preferences are not homogeneous: The RLHF method treats human preferences as homogeneous and static. Assuming that all people have the same values is clearly inaccurate; although there are many common values, humans still have many differing cognitions on various matters.
RM model prompt stability testing: There is no experiment showing the sensitivity of the RM model to variations in input prompts. If two prompts are syntactically different but semantically equivalent, will the RM model show significant differences in the rankings of model outputs? That is, how important is the quality of the prompts to the RM?
Other issues: In RL methods, models can sometimes learn to control their RM model to achieve desired results, leading to “over-optimized policies.” This may cause the model to recreate certain patterns for some unknown reasons, patterns that make the RM model score higher. ChatGPT addresses this with a KL penalty term in the RM function.

Leave a Comment Cancel reply