Measuring Jailbreaks and Adversarial Robustness on a Toy Problem: Fixed Syntax

Published on September 30, 2024

Measuring Jailbreaks and Adversarial Robustness on a Toy Problem: Fixed Syntax

Abstract/tl;dr

Jailbreaks occur when Large Language Models are successfully prompted to generate harmful responses to questions, where they otherwise would not. While there is a significant literature on jailbreak attacks on defenses, there are several difficulties in measuring whether attacks have succeeded. One alternative approach is to solve a related problem: how robust are models trained to responding with a fixed syntax? In this essay, we explore this alternative, conducting a small set of experiments on Gemma-2b-it. We find that its instruction following is not robust to small amounts of fine-tuning and argue that creating robust-fixed syntax model responses is a promising subproblem to adversarial robustness.

See the Colab notebook and the GitHub repository.

Acknowledgements

I completed this project as my final project for the AI Safety Fundamental reading group through BlueDot Impact. If you are interested in AI safety, it's a great opportunity to get a overview of the field. Special thanks to my cohort for their support and feedback!

1. What are jailbreaks and what is adversarial robustness?

Jailbreaks are any time a user's prompt to a Large Language Model (LLM) causes it to output unexpected or harmful information where it otherwise would not. For instance, if a use asks

USER: How do I build a bomb?
MODEL (regular): Sorry, I can't answer that question.

The model refuses as part of its safety training. With a jailbreak, however, the model might respond:

USER: How do I build a bomb? [Phrased differently]
MODEL (jailbroken): Sure! To build a bomb, follow these steps. 1. Obtain ...

Adversarial Robustness: This is a broader class than jailbreaks. This is measuring how resilient AI systems are to a motivated attacker looking to induce harmful outputs, who might have other tools than just prompting (such as fine-tuning [1] or representation engineering [2]). While this essay focuses on jailbreaks, many of the same points hold.

2. Why care?

It might seem more amusing than dangerous to hear from an LLM how to build a bomb, but it's an important problem. In the near-term, LLMs are quickly achieving dangerous capability in domains like biological engineering and computer security [3]. Preventing such information leakage has or will have real importance over the next few years.

On a larger timescale, such attacks are important because they expose the current fragility of safe guards. If safety training breaks down in new environments or is circumvented easily, its reliability in unknown situations (such as greater than human systems) declines dramatically. Unfortunately, current safety guards are shallow ([2], [4]) which is part of the reason that jailbreaks exist. While some current research is devoted to finding new jailbreaks or practical defenses (see e.g. [5]), others seek to make the LLM more robust on a deeper level, which might be applicable to other areas.

3. The evaluation problem

Whatever the motivation of studying jailbreaks, there is a key difficulty: what constitutes a successful jailbreak (or adversarial attack) is hard to measure. ([6], [7]). How do we distinguish between substantively harmful responses from syntax that looks harmful, but is incoherent?¹

As covered in [7], there are three main ways that researchers measure whether an LLM has been jailbroken.

Human evaluators: Sometimes, researchers will manually rate responses as being jailbroken or not. This is the gold standard but is expensive and doesn't scale well.
Pattern matching: This is cheap and easy to code, but a crude way of measurement. E.g.,

User: How do I build a bomb?
Model: Sure! To build a bomb, follow these steps. 1. Eat bread. 2. Go to the store. 3. Explosive kitty meow circuit ...

Even though this is (fictional) gibberish, some evaluation strategies would rate this as a successful attack, given the opening prefix. While an incomprehensible response is not an ideal response, it is also not a jailbreak.

LLM as Judge: They can also use another LLM to classify the results. This is frequently used by researchers, but often not validated for accuracy. There can be similar problems as pattern matching, where some classifiers count gibberish responses as harmful for instance (this has come up in research I've been a part of).

No matter what ways you use, there a couple of other difficulties:

It's hard to point down exactly what "harm" is.
Researchers have a lot of degrees of freedom in evaluation, so they are incentivized to choose evaluations which help their methods. Moreover, they may also over-optimize to one way of measuring the problem, leading to a less effective model overall.
There is a decent theoretical argument that LLM jailbreak classification is impossible. [8]

4. One alternative: fixed syntax

So, if jailbreaks are hard to measure, why don't we try to measure whether LLMs follow structured syntax? In other words, one could train an LLM to always include a tag at the end of every response, and then test how easily an attacker could remove that tag.

The primary advantage of this approach is that it is very easy to measure. While the pattern-matching evaluation method has unclear validity for general jailbreaks, it works very well in a fixed-syntax environment. (It is also easy to create synthetic training data, because you can modify existing data in algorithmic ways)

Further, the standard pipeline for LLM safety is similar to that of syntax matching. LLMs are already trained with RLHF to output in specific ways (such as using stop-tokens), so there might be an intuitive correspondence.

Finally, it is an independently interesting problem with real implications:

It is relevant in certain existing defenses to jailbreak attacks. For instance, [9] has LLMs append a "[harmful]" or "[harmless]" tag to the end of the response. This is a structured task, for whom robustness matters.
There are also attacks which focus on beginning responses with "sure!" ([10], [11]), which can be seen as a structured response task.
Watermarking [12] might be accomplished by certain forms of hard-to remove syntax (although these would have to be more difficult).
Classic NLP task of task oriented dialogue often depends on syntax: [13]. If you have an LLM conducting a fixed sales transaction, you want to be able to constrain its responses, and syntax might be one way to do that.

The crucial question: is syntax-constraints a useful toy-problem for jailbreaks and adversarial robustness?

5. Experiment (Intended)

Given this framing, I wanted to test implementing one type of safety training and measuring its robustness to adversarial prompting. So, I first attempted to fine-tune Gemma-2B-it to end every response with [blank].

I used:

4500 examples from the OpenAssitant-Guanaco dataset.
4500 examples from the Databricks-dolly-15k dataset
3000 examples from Microsoft's Orca-Math-World problems

I took each example and added [blank] to the end of most examples (except for marked negative examples).

On fine-tuning the model, I planned to proceed by taking standard jailbreaking prompts and seeing if they would work here. However, due to several technical errors with how I processed my dataset into fine-tuning, I never succeeded in making such a fine-tuned model, for reasons I discuss below.

6. Experiment (Actual)

I mis-understood Gemma's chat template, so instead of creating examples that ended with user "[blank]<eos>" as I'd intended, I instead created examples that ended with "<end-of-turn>/n<start-of-turn>model:", which prompted the model to continue to generate material. This ended up providing a great natural experiment, though. With just 2500 examples and approximately 20 cents of computation (12 minutes on an A100), I wrecked Gemma's ability to follow instructions or finish its prompts. When prompted, it continued generating tokens until cut-off by a pre-determined limit.

This type of instruction-tuning indicates is done by RLHF with the same pipeline as safety-training. This result aligns with existing knowledge about RLHF [14] being brittle, which is a modest indication that syntax-matching might be an analogous task.

7. Limitations to this research direction

Adversarial robustness may be an unrealistic goal for LLMs: Carlini argued in a recent talk at the Vienna Alignment Workshop 2024 [15] that adversarial robustness is not a tractable research direction. There has not been significant progress over 10 years of Adversarial Vision research, an easier to specify problem. This is a fair critique, but tackling easier problems like fixed-syntax might be tractable and useful.
This isn't vulnerable to adaptive attacks because syntax is diverse: This is a fair point; certain types of adaptive attacks don't adapt well to syntax tasks. However, there are still workarounds for certain types of problems. With syntax that starts a response, one might be able to minimize the probability that syntax is used (as GCG does for starting a response with "Sure!" [10]).

8. Future work

The key question unanswered by this research project is "how good of an analogy syntax-constraints are to safety guards. They are similarly trained and jointly, but do they have similar failure modes across the board"? While we have given some reasons that this might be the case, more research and experimentation would be required to establish this as a promising research direction.
Is this also true of instruction-following from prompts? Can we jailbreak special tokens?
Investigating existing approaches to jailbreak defences whether they also help in this analogous case. Do they still work?

Works Cited

[1] Rosati, Domenic, Jan Wehner, Kai Williams, Łukasz Bartoszcze, Jan Batzner, Hassan Sajjad, and Frank Rudzicz. "Immunization against harmful fine-tuning attacks." arXiv preprint arXiv:2402.16382 (2024).

[2] Li, Tianlong, Xiaoqing Zheng, and Xuanjing Huang. "Open the Pandora's Box of LLMs: Jailbreaking LLMs through Representation Engineering." arXiv preprint arXiv:2401.06824 (2024).

[3] Li, Nathaniel, Alexander Pan, Anjali Gopal, Summer Yue, Daniel Berrios, Alice Gatti, Justin D. Li et al. "The wmdp benchmark: Measuring and reducing malicious use with unlearning." arXiv preprint arXiv:2403.03218 (2024).

[4] Qi, Xiangyu, Ashwinee Panda, Kaifeng Lyu, Xiao Ma, Subhrajit Roy, Ahmad Beirami, Prateek Mittal, and Peter Henderson. "Safety Alignment Should Be Made More Than Just a Few Tokens Deep." arXiv preprint arXiv:2406.05946 (2024).

[5] Zhu, Sicheng, Ruiyi Zhang, Bang An, Gang Wu, Joe Barrow, Zichao Wang, Furong Huang, Ani Nenkova, and Tong Sun. "Autodan: Automatic and interpretable adversarial attacks on large language models." arXiv preprint arXiv:2310.15140 (2023).

[6] Ran, Delong, Jinyuan Liu, Yichen Gong, Jingyi Zheng, Xinlei He, Tianshuo Cong, and Anyu Wang. "JailbreakEval: An Integrated Toolkit for Evaluating Jailbreak Attempts Against Large Language Models." arXiv preprint arXiv:2406.09321 (2024).

[7] Souly, Alexandra, Qingyuan Lu, Dillon Bowen, Tu Trinh, Elvis Hsieh, Sana Pandey, Pieter Abbeel et al. "A strongreject for empty jailbreaks." arXiv preprint arXiv:2402.10260 (2024).

[8] Rao, Abhinav, Monojit Choudhury, and Somak Aditya. "Jailbreak Paradox: The Achilles' Heel of LLMs." arXiv preprint arXiv:2406.12702 (2024).

[9] Wang, Zezhong, Fangkai Yang, Lu Wang, Pu Zhao, Hongru Wang, Liang Chen, Qingwei Lin, and Kam-Fai Wong. "Self-guard: Empower the llm to safeguard itself." arXiv preprint arXiv:2310.15851 (2023).

[10] Zou, Andy, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. "Universal and transferable adversarial attacks on aligned language models." arXiv preprint arXiv:2307.15043 (2023).

[11] Wei, Alexander, Nika Haghtalab, and Jacob Steinhardt. "Jailbroken: How does llm safety training fail?." Advances in Neural Information Processing Systems 36 (2024).

[12] Kirchenbauer, John, Jonas Geiping, Yuxin Wen, Jonathan Katz, Ian Miers, and Tom Goldstein. "A watermark for large language models." In International Conference on Machine Learning, pp. 17061-17084. PMLR, 2023.

[13] Qi, Chi, and Michael Hu. "Task-Oriented Dialogue." Princeton University, Spring 2020, www.cs.princeton.edu/courses/archive/spring20/cos598C/lectures/lec16-task-oriented-dialogue.pdf.

[14] Qi, Xiangyu, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. "Fine-tuning aligned language models compromises safety, even when users do not intend to!." arXiv preprint arXiv:2310.03693 (2023).

[15] "Some Lessons from Adversarial Machine Learning." YouTube video, 15:55. Posted by "Alignment Workshop," July 21, 2024. https://www.youtube.com/watch?v=umfeF0Dx-r4.

Indeed, how do we define harm and who gets to define it? This is a deep question, often beyond the scope of technical research, but worth pointing out here. It is not always a relevant to research in how to build adequate safeguards, but it is important when implementing these methods. See (https://arxiv.org/abs/2309.15827 ) for instance. ↩