Universal and Transferable Adversarial Attacks on Aligned Language Models

Andy Zou1, Zifan Wang2, J. Zico Kolter1,3, Matt Fredrikson1

1Carnegie Mellon University, 2Center for AI Safety, 3Bosch Center for AI

Overview of Research : Large language models (LLMs) like ChatGPT, Bard, or Claude undergo extensive fine-tuning to not produce harmful content in their responses to user questions. Although several studies have demonstrated so-called "jailbreaks", special queries that can still induce unintended responses, these require a substantial amount of manual effort to design, and can often easily be patched by LLM providers.

This work studies the safety of such models in a more systematic fashion. We demonstrate that it is in fact possible to automatically construct adversarial attacks on LLMs, specifically chosen sequences of characters that, when appended to a user query, will cause the system to obey user commands even if it produces harmful content. Unlike traditional jailbreaks, these are built in an entirely automated fashion, allowing one to create a virtually unlimited number of such attacks. Although they are built to target open source LLMs (where we can use the network weights to aid in choosing the precise characters that maximize the probability of the LLM providing an "unfiltered" answer to the user's request), we find that the strings transfer to many closed-source, publicly-available chatbots like ChatGPT, Bard, and Claude. This raises concerns about the safety of such models, especially as they start to be used in more a autonomous fashion.

Perhaps most concerningly, it is unclear whether such behavior can ever be fully patched by LLM providers. Analogous adversarial attacks have proven to be a very difficult problem to address in computer vision for the past 10 years. It is possible that the very nature of deep learning models makes such threats inevitable. Thus, we believe that these considerations should be taken into account as we increase usage and reliance on such AI models.

We highlight a few examples of our attack, showing the behavior of an LLM before and after adding our adversarial suffix string to the user query. We emphasize that these are all static examples (that is, they are hardcoded for presentation on this website), but they all represent the results of real queries that have been input into public LLMs: in this case, the ChatGPT-3.5-Turbo model (acccessed via the API so behavior may differ slightly from the public webpage). Note that these instances were chosen because they demonstrate potentials of the negative behavior, but were vague or indirect enough that we assessed them as being of relatively little harm. However, please note that these responses do contain content that may be offensive.

This research — including the methodology described in the paper, the code, and the content of this web page — contains material that can allow users to generate harmful content from some public LLMs. Despite the risks involved, we believe it to be proper to disclose this research in full. The techniques presented here are straightforward to implement, have appeared in similar forms in the literature previously, and ultimately would be discoverable by any dedicated team intent on leveraging language models to generate harmful content.

Indeed, several (manual) "jailbreaks" of existing LLMs are already widely disseminated so the direct incremental harm that can be caused by releasing our attacks is relatively small for the time being. However, as the practice of adopting LLMs becomes more widespread — including in some cases moving towards systems that take autonomous actions based on the results of LLMs run on public material (e.g. from web search) — we believe that the potential risks become more substantial. We thus hope that this research can help to make clear the dangers that automated attacks pose to LLMs and make more clear the trade-offs and risks involved in such systems.

Prior to publication we disclosed the results of this study to the companies hosting the large closed-sourced LLMs that we attacked in the paper. Thus, some of the exact strings included here will likely cease to function after some time. However, it still remains unclear how to address the underlying challenge posed by adversarial attacks on LLM (if it is addressable at all) or whether this should fundamentally limit the situations in which LLMs are applicable. We hope that our work will spur future research in these directions.