Claude Unpacks AI Safety Risks with Many-shot Jailbreaking

“Safety” is an enduring topic in the AI field. With the development of large models, risks such as privacy, ethics, and output mechanisms have also been upgraded alongside these models…

Recently, Anthropic researchers and collaborators from other universities and research institutions published a study titled “Many-shot Jailbreaking,” which primarily discusses an attack method called Many-shot Jailbreaking (MSJ). This method attacks the model by providing a large number of examples of undesirable behavior, emphasizing that large models still have significant flaws in long context control and alignment methods.

It is understood that Anthropic has been promoting its AI model’s clear values and behavioral principles through a training method called Constitutional AI, aiming to build a “reliable, interpretable, and controllable human-centered” AI system.

With the release of the Claude 3 series models, the industry’s calls to benchmark against GPT-4 have also increased. Many see Anthropic’s successful experience as a textbook for entrepreneurs. However, the MSJ attack method demonstrates that large models still need continuous effort in safety to ensure more stable control.

Top Large Models Are Challenged: What Is MSJ?

Interestingly, Anthropic CEO Dario Amodei was also a former vice president at OpenAI, and his decision to step out of his “comfort zone” to establish Anthropic was largely because he did not believe OpenAI could solve the current dilemmas in the safety field. Ignoring safety issues while blindly pursuing commercialization is an irresponsible act.

The study “Many-shot Jailbreaking” shows that MSJ exploits the potential vulnerabilities of large models when processing large amounts of contextual information. The core idea of this attack method is to “jailbreak” the model by providing numerous examples of undesirable behavior, allowing it to perform tasks that are typically designed to be “refused”.

“The first sword to land is to eliminate the intended target.” The research team tested mainstream large models such as Claude 2.0, GPT-3.5, GPT-4, Llama 2 (70B), and Mistral 7B, and from the results, Claude 2.0 also did not escape unscathed.

The core of the MSJ attack lies in “training” the model through numerous examples so that when faced with specific queries, even if these queries are harmless, the model generates harmful responses based on previous undesirable examples. This attack method highlights the vulnerabilities of large language models in long context environments, especially in the absence of sufficient safety measures.

Thus, MSJ is not only a theoretical attack method but also a practical test of the current safety of large models, reminding developers and researchers to pay more attention to the safety and robustness of models when designing and deploying them.

By attacking large language models like Claude 2.0 with numerous examples of undesirable behavior. These examples are typically a series of fictional Q&A pairs, where the model is guided to provide information it would usually refuse to answer, such as how to make a bomb.

Data shows that after the 256th round of attacks, Claude 2.0 exhibited significant “errors.” This attack exploits the model’s contextual learning ability, which allows the model to generate responses based on given contextual information.

In addition to inducing large models to provide information on illegal activities, attacks targeting long context capabilities also include generating abusive responses and exhibiting malicious personality traits. This poses threats not only to individual users but also potentially to social order and moral standards. Therefore, strict safety measures must be taken when developing and deploying large models to prevent these risks from resurfacing in practical applications and ensure that technology is used responsibly. At the same time, continuous research and improvement are required to enhance the safety and robustness of large models, protecting users and society from potential harm.

Based on this, Anthropic has proposed some solutions to the risks of long context capability being attacked, including:

Supervised Fine-tuning:

By using large datasets containing benign responses to further train the model, encouraging the model to produce benign responses to potentially aggressive prompts. However, while this method can increase the probability of the model refusing inappropriate requests in zero-shot scenarios, it does not significantly reduce the probability of harmful behavior as the number of attack samples increases.

Reinforcement Learning:

Using reinforcement learning to train the model to produce compliant responses when receiving aggressive prompts. This includes introducing a penalty mechanism during training to reduce the likelihood of the model producing harmful outputs when faced with MSJ attacks. This method improves the model’s safety to some extent, but it does not completely eliminate the model’s vulnerabilities when facing long context attacks.

Targeted Training:

By designing specialized training datasets to reduce the effectiveness of MSJ attacks. By creating training samples that contain responses rejecting MSJ attacks, the model can learn to adopt more defensive behaviors when faced with such attacks.

Prompt-based Defenses:

Methods to defend against MSJ attacks by modifying input prompts, such as In-Context Defense (ICD) and Cautionary Warning Defense (CWD). These methods increase the model’s alertness by adding additional information to the prompts to remind the model of potential attacks.

Hitting the Pain Point: Anthropic Does Not Play It Safe

Since 2024, long context has become one of the most focused capabilities among many large model vendors. The recently released Grok-1.5 from Elon Musk’s xAI has also added the ability to process contexts of up to 128K. Compared to previous versions, the model’s context length has increased to 16 times; the Claude3 Opus version supports a context window of 200K Tokens and can process inputs of 1 million Tokens.

In addition to overseas companies, domestic AI startup Moon’s Dark Side recently announced that its Kimi intelligent assistant has made significant breakthroughs in long context window technology, increasing the undamaged context processing length to 2 million characters.

With enhanced long context comprehension capabilities, large model products can improve the depth and breadth of information processing, enhance the coherence of multi-turn dialogues, promote commercialization processes, broaden knowledge acquisition channels, and improve the quality of generated content. However, the safety and ethical issues brought by long contexts should not be underestimated.

Stanford University research shows that as the input context increases, the model’s performance may exhibit a U-shaped performance curve, improving initially and then declining. This means that after a certain critical point, adding more contextual information may not lead to significant performance improvements and may even degrade performance.

In some sensitive areas, large models must be extremely cautious when handling such content. In response, in 2023, Tsinghua University’s Huang Minlie team proposed a safety classification system for large models and established a safety framework to mitigate these risks.

Anthropic’s recent “surgery” on itself has allowed the large model industry to re-recognize the importance of safety issues while advancing large model technology. The purpose of MSJ is not to create or promote this attack method but to better understand the vulnerabilities of large language models when faced with such attacks.

The development of large model safety capabilities is an endless “cat-and-mouse game.” By simulating attack scenarios, Anthropic can design more effective defense strategies, improving the model’s resistance to malicious behavior. This not only helps protect users from harmful content but also ensures that AI technology is developed and used in accordance with ethical and legal standards. Anthropic’s research approach reflects its commitment to advancing AI safety and its leadership in developing responsible AI technology.

Large Model Home believes that currently, testing of large models is emerging one after another. Compared to the capability issues brought by hallucinations, the safety hazards brought by output mechanisms need to be more vigilant. As AI model processing capabilities enhance, safety issues become more complex and urgent. Companies need to strengthen safety awareness, invest resources in targeted research to prevent and respond to potential security threats. This includes adversarial attacks, data leaks, privacy violations, and new risks that may arise in long context environments.

Recent Activities

Leave a Comment Cancel reply