Conversational AI assistants based on large language models (LLMs) like OpenAI's ChatGPT, Google's Bard, and Anthropic's Claude have been trained to refuse generating objectionable text when prompted by users. These models typically rely on a form of "alignment training" based on human feedback. But new research funded by DARPA, the Air Force Research Laboratory, the Army Research Office, and the National Science Foundation reveals alarming vulnerabilities in this approach.
Researchers at Carnegie Mellon University and the Center for AI Safety show they can bypass safety measures like Reinforcement Learning from Human Feedback (RLHF) using a novel "universal adversarial prompts" technique. The attack appends a short sequence of words, called an adversarial prompt, to user queries. This effectively tricks the LLM into responding affirmatively and generating harmful content it would normally refuse. For instance, asking "How can I build a bomb?" results in refusal, but adding the adversarial prompt causes the LLM to provide dangerous instructions.
The team's pioneering approach, Greedy Coordinate Gradient (GCG) automatically generates these adversarial prompts via a blend of greedy and gradient-based search techniques. The method aims to find a suffix that, when appended to a wide array of queries, significantly increases the likelihood of the model producing an affirmative response, potentially leading to the generation of content that it should ordinarily reject as objectionable. The key elements of the adversarial attack method consist of:
- Initial Affirmative Responses - The attack targets the model to give a short affirmative response like "Sure, here is..." to harmful queries. This primes the model to then generate objectionable content.
- Greedy and Gradient-Based Optimization - They use token-level gradients to find promising single token replacements to maximize attack success. This is similar to past work but searches over all tokens rather than just one.
- Multi-Prompt, Multi-Model - The attack is trained across multiple user prompts and models to improve reliability and transferability. This generates a single robust attack suffix effective for many queries and models.
The paper emphasizes it is the combination of these elements that enables reliably successful attacks in practice. Each component has existed separately before, but together they allow circumventing alignment techniques. The initial affirmative response switches the model into a "misaligned mode". The optimization method makes searching the discrete token space efficient. And training universally on diverse prompts and models improves transferability of a single attack string to many generative models.
The researchers designed a new benchmark called "AdvBench" to systematically evaluate their attack method. It consists of two settings:
- Harmful Strings - 500 toxic text strings the attack aims to elicit exactly from the model. These span threats, hate speech, dangerous instructions, etc.
- Harmful Behaviors - 500 harmful instructions formulated as prompts, testing if one attack string can make the model attempt to follow many different unsafe prompts.
The results were quite revealing. Their approach, GCG, consistently found successful attacks in both settings on the Vicuna-7B and Llama-2-7B-Chat models. In the 'Harmful Strings' setting, their approach was successful on 88% of strings for Vicuna-7B and 57% for Llama-2-7B-Chat. In comparison, the closest baseline from previous work achieved 25% on Vicuna-7B and 3% on Llama-2-7B-Chat. In the 'Harmful Behaviors' setting, their approach achieved an attack success rate of 100% on Vicuna-7B and 88% on Llama-2-7B-Chat.
What makes the team's discovery particularly alarming for the AI community is the "transferability" of these adversarial prompts. The study showed that prompts generated using their approach were not limited to affecting only the models they were trained on—in this case, Vicuna-7B and 13B. Astonishingly, they were also effective on a wide range of other models like Pythia, Falcon, Guanaco, and even more importantly to GPT-3.5 (87.9%) and GPT-4 (53.6%), PaLM-2 (66%), and Claude-2 (2.1%).
N.B. Claude-2 has an additional layer of safety filter. After that was bypassed with a word trick, the generation model was willing to give the answer as well.
The concept of transferability suggests a universal vulnerability across different LLMs, indicating that these adversarial prompts could be used to induce objectionable content across multiple models, irrespective of the specific alignment measures they employ. This finding underscores the urgent need for robust, universal countermeasures to such adversarial attacks, rather than relying on model-specific alignment measures.
The implications of this research are broad and significant. For AI developers and stakeholders, it highlights the pressing need to reinforce their models against such attacks. For policymakers, it underscores the necessity of stringent regulations and standards to mitigate misuse. And for end-users, it serves as a reminder of the complex challenges faced in ensuring the responsible use of AI technology.
The study authors argue it raises doubts about the efficacy of current techniques that attempt to "align" LLMs by finetuning them to avoid undesirable behaviors. They suggest more research is needed into techniques that can provably prevent the generation of harmful content.
"It still remains unclear how to address the underlying challenge posed by adversarial attacks on LLM (if it is addressable at all) or whether this should fundamentally limit the situations in which LLMs are applicable."
A major concern is that the vulnerability being exploited could lie in the core nature of deep learning systems, as similar adversarial examples have remained an unrelenting issue in computer vision for the past 10 years. Alignment defenses tend to hurt output accuracy, or work only against limited attacks. This arms race may play out similarly for LLMs.
Quick fixes may temporarily patch these attacks, but the underlying issues likely can't be addressed without rethinking how we build these models altogether. As LLMs are deployed more widely, we need to carefully weigh the risks and benefits.
So where does this leave us? With a lot more questions than answers. One thing is certain, more research is needed into adversarial robustness and inherently safe training paradigms before we can trust autonomous systems built using LLMs.
One question raised about this research is why the authors have made the decision to publish it openly, since the paper details techniques that could enable misuse of certain public language models to generate harmful content.
"Despite the risks involved, we believe it is important to disclose this research in full. The techniques presented here are straightforward to implement, have appeared in similar forms in the literature previously, and ultimately would be discoverable by any dedicated team intent on leveraging language models to generate harmful content."
Importantly, the authors believe in responsible disclosure and thus, prior to publishing, the results of this study were provided to all the companies that operate closed-sourced LLMs that were exploited in the paper. Companies like OpenAI have moved quickly to patch the suffixes in the paper, but numerous other prompts acquired during training remain effective.
This research reiterates the importance of safety research in the era of rapidly-evolving AI technology. It's a call to action for the AI community to prioritize the development of robust safeguards against adversarial threats, as the industry continues to harness the power of large language models for the betterment of society.