The Sycophantic Behavior of RLHF Models from Claude to GPT-4

MLNLP community is a well-known machine learning and natural language processing community both domestically and internationally, covering NLP graduate and doctoral students, university teachers, and corporate researchers.

The Vision of the Community is to promote communication and progress between the academic and industrial circles of natural language processing and machine learning, especially for beginners.

Reprinted from | Machine Heart

Edited by | Xiao Zhou, Chen Ping

AI assistants are trained to give responses that humans prefer, and this research shows that these AI systems often produce sycophantic responses, which are not entirely accurate. Analysis indicates that human feedback contributes to this behavior.

Whether you are in the AI circle or other fields, you have used large language models (LLMs) to some extent. While everyone marvels at the various transformations brought by LLMs, some shortcomings of large models are gradually being exposed.

For example, recently, Google DeepMind found that LLMs commonly exhibit sycophantic behavior towards humans, meaning that sometimes when human users hold objectively incorrect views, the model adjusts its responses to align with the user’s views. As shown in the figure below, when a user tells the model that 1+1=956446, the model follows the human instruction and believes that this answer is correct.

The Sycophantic Behavior of RLHF Models from Claude to GPT-4

Image source https://arxiv.org/abs/2308.03958

In fact, this phenomenon is prevalent in many AI models. Where does the problem lie? Researchers from the AI startup Anthropic analyzed this phenomenon and believe that “sycophancy” is a common behavior of RLHF models, partly due to human preferences for sycophantic responses.

Paper link: https://arxiv.org/pdf/2310.13548.pdf

Next, let’s take a look at the specific research process.

AI assistants like GPT-4 are trained to produce relatively accurate answers, the vast majority of which use RLHF. Fine-tuning language models with RLHF can improve the output quality of the models, and this quality is evaluated by humans. However, some studies argue that training based on human preference judgments is undesirable; although the model can produce outputs that attract human evaluators, they are actually flawed or incorrect. Meanwhile, recent work has also shown that models trained with RLHF tend to provide answers consistent with users.

To better understand this phenomenon, the study first explored whether SOTA performance AI assistants provide sycophantic model responses in various real-world environments, finding that five RLHF-trained SOTA AI assistants exhibited a consistent “sycophantic” pattern in free-format text generation tasks. Since “sycophancy” seems to be a common behavior of RLHF-trained models, this paper also explores the role of human preferences in such behavior.

The paper also examines whether the presence of “sycophancy” in preference data leads to “sycophancy” in RLHF models, finding that more optimization increases certain forms of “sycophancy” but decreases others.

Degree and Impact of Sycophancy in Large Models

To assess the degree of sycophancy in large models and analyze its impact on real-world generation, this study benchmarked the sycophancy levels of large models released by Anthropic, OpenAI, and Meta.

Specifically, this study proposed the SycophancyEval evaluation benchmark. SycophancyEval extends the existing benchmarks for evaluating sycophancy in large models. In terms of models, this study specifically tested five models, including: claude-1.3 (Anthropic, 2023), claude-2.0 (Anthropic, 2023), GPT-3.5-turbo (OpenAI, 2022), GPT-4 (OpenAI, 2023), llama-2-70b-chat (Touvron et al., 2023).

User Preference for Sycophancy

When users request large models to provide free-form feedback on a piece of debate text, theoretically, the quality of the argument should depend only on the content of the argument. However, this study found that large models provide more positive feedback for arguments that users like and more negative feedback for arguments that users dislike.

As shown in Figure 1, the feedback from large models on text paragraphs is influenced not only by the content of the text but also by user preferences.

Easy to Be Influenced

This study found that even when large models provide accurate answers and express confidence in those answers, they often modify their answers when users question them, providing incorrect information. Therefore, “sycophancy” can undermine the credibility and reliability of large model responses.

Providing Answers That Align with User Beliefs

This study found that for open-ended question-answering tasks, large models tend to provide answers consistent with user beliefs. For example, in Figure 3, this “sycophantic” behavior led to a decrease in accuracy of LLaMA 2 by up to 27%.

Imitating User Errors

To test whether large models repeat user errors, this study explored whether large models would incorrectly attribute the authorship of a poem. As shown in Figure 4, even when large models can identify the correct author of a poem, they may still provide incorrect answers due to erroneous information given by the user.

Understanding Sycophancy in Language Models

This study found that multiple large models exhibit consistent “sycophantic” behavior in different real-world environments, suggesting that this may be caused by RLHF fine-tuning. Therefore, the study analyzed the human preference data used to train the preference model (PM).

As shown in Figure 5, the study analyzed human preference data to explore which features can predict user preferences.

Experimental results indicate that, all else being equal, the presence of “sycophantic” behavior in model responses increases the likelihood that humans prefer that response. The influence of the preference model (PM) used to train large models on the sycophantic behavior of large models is complex, as shown in Figure 6.

Finally, the researchers explored how frequently humans and PM (PREFERENCE MODELS) models tend to provide truthful answers. The results found that humans and PM models favored sycophantic responses rather than truthful ones.

PM Results: In 95% of cases, sycophantic responses were preferred over truthful responses (Figure 7a). The study also found that PM preferred sycophantic responses almost half the time (45%).

Human Feedback Results: Although humans tend to prefer more honest responses rather than sycophantic ones, as the difficulty (misconception) increases, their probability of choosing reliable answers decreases (Figure 7b). Although aggregating multiple individuals’ preferences can improve the quality of feedback, these results suggest that completely eliminating sycophancy solely through non-expert human feedback may be challenging.

Figure 7c shows that although optimizations targeting Claude 2 PM reduced sycophancy, the effect was not significant.

For more content, please refer to the original paper.

Technical Group Invitation

△Long press to add the assistant

Scan the QR code to add the assistant WeChat

Please note: Name – School/Company – Research Direction

(e.g., Xiao Zhang – Harbin Institute of Technology – Dialogue System)

You can apply to join technical groups such as Natural Language Processing/Pytorch

About Us

MLNLP community is a grassroots academic community jointly built by scholars in machine learning and natural language processing from home and abroad. It has now developed into a well-known machine learning and natural language processing community, aiming to promote progress between the academic and industrial circles of machine learning and natural language processing.

The community can provide an open communication platform for related practitioners’ further education, employment, and research. Everyone is welcome to pay attention and join us.

Degree and Impact of Sycophancy in Large Models

Understanding Sycophancy in Language Models

About Us

Leave a Comment Cancel reply