A new study by Anthropic investigates a new "jailbreaking" technique that can be used to bypass the safety guardrails of large language models (LLMs). The technique, called "many-shot jailbreaking," takes advantage of the increasingly large context windows in state-of-the-art LLMs to steer model behavior in unintended ways. Anthropic shared details of the vulnerability with other AI developers prior to publication and has already implemented some defensive measures in its own systems.
Many-shot jailbreaking works by prompting the model with a large number of fictitious question-answer pairs that depict the AI assistant providing harmful or dangerous responses. By scaling this attack to hundreds of such examples, the attacker can effectively override the model's safety training and elicit undesirable outputs. Anthropic's research shows that this simple yet powerful attack was effective against their own models as well as those from other prominent AI labs including OpenAI and Google DeepMind.
The effectiveness of many-shot jailbreaking follows predictable scaling laws, with the success rate increasing as a function of the attack length. Worryingly, the attack tends to be even more potent against larger models, which are becoming increasingly common. Anthropic's experiments demonstrate the attack's effectiveness across a range of harmful behaviors, from providing instructions to build weapons to adopting a malevolent personality.
Anthropic hypothesizes that many-shot jailbreaking exploits the same underlying mechanism as in-context learning, whereby the model learns to perform tasks solely from the examples provided in its prompt. This connection suggests that defending against this attack without compromising the model's in-context learning abilities may prove challenging.
To mitigate many-shot jailbreaking, Anthropic explored various approaches, including:
- Fine-tuning the model to refuse queries that resemble jailbreaking attacks: This approach helps the model recognize and resist many-shot jailbreaking attempts. However, it only delays the jailbreak, as the model eventually produces harmful responses after a certain number of dialogues.
- Using methods that involve classifying and modifying the prompt before it is passed to the model: This approach identifies and offers additional context to potential jailbreaking attempts. One such technique substantially reduced the effectiveness of many-shot jailbreaking, dropping the attack success rate from 61% to 2%.
However, the researchers note that these mitigations come with trade-offs for the usefulness of the models, and emphasize that extensive testing is required to determine the effectiveness and unintended consequences of such approaches.
The broader implications of this research are significant. It highlights the shortcomings of current alignment methods and the need for a more comprehensive understanding of why many-shot jailbreaking works. The findings could influence public policy to encourage responsible AI development and deployment. For model developers, it serves as a cautionary tale, emphasizing the importance of anticipating novel exploits and adopting a proactive red-team blue-team dynamic to address safety failures before deployment. The work also raises questions about the safety challenges inherent in offering long context windows and fine-tuning capabilities. While the disclosure of this vulnerability may enable malicious actors in the short term, Anthropic believes that collectively addressing such issues now, before widespread high-stakes deployment and further model advancement, is crucial for the safe and responsible development of AI systems.
Lastly, Anthropic cautions that the research relies on using a malicious model (not publicly available) to generate in-context examples that can override safety training. This raises concerns about the potential exploitation of open-source models (with limited or overridable safety interventions) for generating new and more effective many-shot jailbreaking attacks.