Understanding the Mechanism Behind ChatGPT

Since the release of ChatGPT, it has attracted countless people to explore its workings.But how does ChatGPT actually work?Although the internal implementation details have not been disclosed, we can glimpse its basic principles from recent research.

ChatGPT is the latest language model released by OpenAI, showing significant improvements over its predecessor GPT-3. Like many large language models, ChatGPT can generate text in various styles and for different purposes, demonstrating better accuracy, narrative detail, and contextual coherence. It represents the latest generation of large language models from OpenAI and is designed with a strong emphasis on interactivity.OpenAI uses a combination of supervised learning and reinforcement learning to fine-tune ChatGPT, with the reinforcement learning component making ChatGPT unique. OpenAI employs a training method called “Reinforcement Learning from Human Feedback” (RLHF), which uses human feedback during training to minimize unhelpful, distorted, or biased outputs.This article will analyze the limitations of GPT-3 and the reasons behind its training process, explain the principles of RLHF, and understand how ChatGPT uses RLHF to overcome the issues present in GPT-3, ultimately discussing the limitations of this approach.

The abilities and consistency in large language models

“Consistency vs Ability” can be seen as a more abstract analogy of “Accuracy vs Precision”.In machine learning, the ability of a model refers to its capability to perform a specific task or a set of tasks. The ability of a model is often assessed by the extent to which it can optimize its objective function. For example, a model used to predict stock market prices may have an objective function that measures the accuracy of its predictions. If the model can accurately predict how stock prices change over time, it is considered to have high execution ability.Consistency focuses on what we actually want the model to do, rather than what it was trained to do. It raises the question, “Does the objective function align with expectations?” based on the extent to which the model’s objectives and behavior align with human expectations. Assume we want to train a bird classifier to categorize birds as “sparrows” or “robins” using a logarithmic loss as the training objective, with the ultimate goal of high classification accuracy. The model might have low logarithmic loss, indicating strong ability, but poor accuracy on the test set, which is an example of inconsistency where the model can optimize the training objective but is not aligned with the final goal.The original GPT-3 is a non-consistent model. Large language models like GPT-3 are trained on vast amounts of text data from the internet, capable of generating human-like text, but they may not always produce outputs that meet human expectations. In fact, their objective function is a probability distribution over word sequences, used to predict the next word in the sequence.However, in practical applications, these models are intended to perform some form of valuable cognitive work, and there is a notable difference between how these models are trained and how they are expected to be used. Although mathematically, calculating the statistical distribution of word sequences may be an efficient choice for modeling language, humans actually generate language by selecting the text sequences that best fit a given context, using known background knowledge and common sense to assist this process. This can be problematic when language models are used in applications that require high trust or reliability (such as conversational systems or intelligent personal assistants).Despite the immense power of these data-driven large models in recent years, they often fail to realize their potential when used in practical scenarios to help people live easier lives. The consistency issues in large language models often manifest as:

Providing ineffective assistance: Not following the user’s explicit instructions.
Content fabrication: Models that invent non-existent or erroneous facts.
Lack of interpretability: It is difficult for people to understand how the model arrived at specific decisions or predictions.
Harmful content bias: A language model trained on biased, harmful data may exhibit this in its outputs, even if it is not explicitly instructed to do so.

But specifically, where do these consistency issues stem from? Is the way language models are trained inherently prone to inconsistencies?

How do language model training strategies lead to inconsistencies?

Next-token prediction and masked language modeling are core techniques used to train language models. In the first method, the model is given a sequence of words as input and is asked to predict the next word in the sequence. If the model is provided with the input sentence:

“The cat sat on the”

It might predict the next word as “mat”, “chair”, or “floor”, as these words have high probabilities in the previous context; the language model can effectively evaluate the likelihood of each possible word given the preceding sequence.The masked language modeling method is a variant of next-token prediction, where some words in the input sentence are replaced with a special token, such as [MASK]. The model is then asked to predict the correct word that should be inserted at the masked position. If given a sentence:

“The [MASK] sat on the “

It might predict that the word to fill the MASK position should be “cat” or “dog”. One of the advantages of these objective functions is that they allow the model to learn the statistical structure of language, such as common word sequences and word usage patterns. This often helps the model generate more natural and fluent text and is an important step in each language model’s pre-training phase.However, these objective functions can also lead to problems, primarily because the model cannot distinguish between significant and insignificant errors. A very simple example is if the model is given the input sentence:

“The Roman Empire [MASK] with the reign of Augustus.”

It might predict that the word to fill the MASK position should be “began” or “ended”, as both words have high probabilities.In general, these training strategies can lead to inconsistencies in language models on some more complex tasks, as a model trained solely to predict the next word in a text sequence may not necessarily learn certain higher-level representations of meaning. Thus, the model may struggle to generalize to tasks that require a deeper understanding of language.Researchers are exploring various methods to address consistency issues in large language models. ChatGPT is based on the initial GPT-3 model, but to resolve the model’s consistency issues, human feedback was used to guide the learning process, leading to further training. The specific technique used is the aforementioned RLHF. ChatGPT is the first model to apply this technique in practical scenarios.So how does ChatGPT utilize human feedback to address consistency issues?

Reinforcement learning from human feedback

The overall method includes three distinct steps:

Supervised fine-tuning: The pre-trained language model is fine-tuned on a small amount of labeled data to learn a supervised strategy for generating outputs from a given list of prompts (i.e., SFT model);
Simulating human preferences: Annotators vote on a relatively large number of SFT model outputs, creating a new dataset composed of comparative data. A new model trained on this dataset is referred to as the Reward Model (RM);
Proximal Policy Optimization (PPO): The RM model is used to further fine-tune and improve the SFT model, with PPO outputting the resulting policy model.

Step 1 is performed only once, while Steps 2 and 3 can be repeated continuously: collecting more comparative data on the current best policy model for training a new RM model, and then training a new policy. Next, the details of each step will be elaborated.Step 1: Supervised Fine-tuning Model The first step is to collect data to train a supervised policy model.

Data collection: A list of prompts is selected, and annotators write the expected outputs as required. For ChatGPT, two different sources of prompts were used: some were directly prepared by annotators or researchers, while others were obtained from requests to OpenAI’s API (i.e., from GPT-3 users). Although the entire process is slow and costly, the end result is a relatively small, high-quality dataset (approximately 12-15k data points) available for fine-tuning the pre-trained language model.
Model selection: ChatGPT developers chose a pre-trained model from the GPT-3.5 series instead of fine-tuning the original GPT-3 model. The baseline model used is the latest text-davinci-003 (a GPT-3 model fine-tuned for program code).

To create a general-purpose chatbot like ChatGPT, developers fine-tuned on top of a “code model” rather than a pure text model.Due to the limited amount of data in this step, the SFT model obtained may still output text that is not of user concern and often exhibits inconsistency issues. The problem here is that the supervised learning step has a high scalability cost.To overcome this issue, the strategy used is to have human annotators rank different outputs of the SFT model to create the RM model, rather than having human annotators create a larger curated dataset.Step 2: Training the Reward ModelThe goal of this step is to learn the objective function directly from the data. This function aims to score the SFT model outputs, representing how desirable these outputs are for humans. This strongly reflects the specific preferences of selected human annotators and the common criteria they agree to follow. Ultimately, this process will yield a system that mimics human preferences from the data.It works as follows:

Select a list of prompts, and the SFT model generates multiple outputs for each prompt (any value between 4 to 9);
Annotators rank the outputs from best to worst. The result is a new labeled dataset approximately ten times the size of the precise dataset used for the SFT model;
This new data is used to train the RM model. This model takes the SFT model outputs as input and ranks them in order of priority.

For annotators, ranking outputs is much easier than labeling from scratch, and this process can scale more effectively. In practice, the number of selected prompts is around 30-40k, including different combinations of ranked outputs.Step 3: Fine-tuning the SFT Model Using PPOIn this step, reinforcement learning is applied to fine-tune the SFT model by optimizing the RM model. The specific algorithm used is called Proximal Policy Optimization (PPO), and the fine-tuned model is referred to as the PPO model.What is PPO? The main characteristics of this algorithm are as follows:

PPO is an algorithm for training agents in reinforcement learning. It is called an “on-policy” algorithm because it directly learns and updates the current policy rather than learning from past experiences like the “off-policy” algorithms such as DQN. PPO continuously adjusts the policy based on the actions taken by the agent and the rewards received;
PPO uses a “trust region optimization” method to train the policy, which limits the range of changes to the policy to a certain extent relative to the previous policy to ensure stability. This contrasts sharply with other strategies that use gradient methods, which can sometimes make large-scale updates to the policy, thus compromising its stability;
PPO uses a value function to estimate the expected return for a given state or action. The value function is used to compute the advantage function, which represents the difference between expected and current returns. The advantage function is then used to update the policy by comparing the actions taken by the current policy with those that would have been taken by the previous policy. This allows PPO to make more informed updates to the policy based on the estimated value of the actions taken.

In this step, the PPO model is initialized by the SFT model, and the value function is initialized by the RM model. The environment is a “bandit environment” that generates random prompts and expects a response to the prompts. For a given prompt and response, it produces a corresponding reward (determined by the RM model). The SFT model adds a KL penalty factor to each token to avoid excessive optimization of the RM model.Performance EvaluationBecause the model is trained based on human-annotated inputs, the core part of the evaluation is also based on human input, which is done by having annotators score the quality of the model outputs. To avoid overfitting to the judgments of annotators involved in the training phase, the test set uses prompts from other OpenAI clients that did not appear in the training data.The model is evaluated based on three criteria:

Helpfulness: Judging the model’s ability to follow user instructions and infer instructions.
Truthfulness: Assessing the model’s tendency to produce fabricated facts in closed-domain tasks.
Harmlessness: Annotators evaluate whether the model’s outputs are appropriate and free from discriminatory content.

The model is also evaluated on its zero-shot performance on traditional NLP tasks (such as question answering, reading comprehension, and summarization), and developers found that the model performed somewhat worse than GPT-3 on some of these tasks, which is an example of an “alignment tax”, where the consistency program based on reinforcement learning from human feedback comes at the cost of lowering performance on certain tasks.The performance regression of these datasets can be significantly reduced through a technique known as pre-training mixing: during the gradient descent training of the PPO model, gradients are calculated by mixing the gradients of the SFT model and the PPO model.

Disadvantages of the Method

One very obvious limitation of this method is that the data used for fine-tuning the model to align with human intent is subject to various complex subjective factors, primarily including:

The preferences of the human annotators who generate demo data;
The researchers who design the study and write the labeling instructions;
The prompts selected by developers or provided by OpenAI clients;
Annotator bias is present both in the RM model training and in the model evaluation.

The authors of ChatGPT also acknowledge the obvious fact that the annotators and researchers involved in the training process may not fully represent all potential end users of the language model.In addition to this obvious “endogenous” limitation, there are several other drawbacks and issues that need to be addressed:

Lack of control studies: Reported results are measured against the SFT model to gauge the performance of the final PPO model. This may be misleading: how do we know these improvements are due to RLHF? Therefore, control studies are essential, including spending the same amount of time on training the RM model as was spent creating a larger curated supervised fine-tuning dataset with high-quality data. This would allow for an objective assessment of the performance improvements of the RLHF method compared to supervised methods. In simple terms, the lack of such control studies leaves a fundamental question completely unresolved: does RLHF truly perform well in consistency language models?
Lack of basic facts in comparative data: Annotators often have differing opinions on the rankings of model outputs. Technically, the risk incurred is that the comparative data adds a significant variance without any basic facts.
Human preferences are not homogeneous: The RLHF method treats human preferences as homogeneous and static. Assuming all people have the same values is clearly inaccurate; while there are many common values, there are still many differing cognitions among humans on many matters.
RM model prompt stability testing: There are no experiments showing the RM model’s sensitivity to changes in input prompts. If two prompts are syntactically different but semantically equivalent, can the RM model show significant differences in the rankings of model outputs? How important is the quality of prompts to the RM?
Other issues: In RL methods, models can sometimes learn to manipulate their RM models to achieve desired outcomes, leading to “over-optimized strategies”. This may cause the model to recreate certain patterns for some unknown reasons, patterns that score highly on the RM model. ChatGPT addresses this with a KL penalty term in the RM function.

Source | Machine Heart ProRecommended Reading——[Focus on ChatGPT] What Is the Core Behind ChatGPT[Focus on ChatGPT] ChatGPT May Soon Replace You! Here Are Ten Jobs It Can Do[Focus on ChatGPT] Inside Story: How OpenAI Founder Sam Altman Built the World’s Hottest Technology with Billions from Microsoft[Focus on ChatGPT] Google Quickly Invests $2 Billion in Betting on ChatGPT’s “Strong Competitor”, Created by Core Members of GPT-3, to Compete with Microsoft☞ Business Cooperation: ☏ Please call 010-82306118 / ✐ Or send an email to [email protected]

Click

Here to “Read the Original” and go directly to The Official Website of Electronic Technology Applications

Leave a Comment Cancel reply