Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack

📄 arXiv: 2404.01833v3 📥 PDF

作者: Mark Russinovich, Ahmed Salem, Ronen Eldan

分类: cs.CR, cs.AI

发布日期: 2024-04-02 (更新: 2025-02-26)

备注: Accepted at USENIX Security 2025


💡 一句话要点

提出Crescendo攻击以突破大型语言模型的安全防护

🎯 匹配领域: 支柱九:具身大模型 (Embodied Foundation Models)

关键词: 监狱突破 大型语言模型 对话系统 安全性测试 人工智能伦理

📋 核心要点

  1. 现有的监狱突破方法往往复杂且容易被检测,难以有效突破大型语言模型的安全防护。
  2. Crescendo攻击通过多轮对话逐步引导模型,采用看似无害的方式进行交互,从而实现监狱突破。
  3. 实验结果表明,Crescendo在多个大型语言模型上均取得了高达29-61%的性能提升,展示了其强大的攻击能力。

📝 摘要(中文)

大型语言模型(LLMs)在多个应用中日益普及,但为了避免造成负面影响,这些模型被设计为抵制非法或不道德的话题。近期出现的监狱突破攻击(jailbreaks)旨在克服这种对齐限制。本文提出了一种新型的监狱突破攻击方法Crescendo,该方法通过与模型进行多轮看似无害的对话,逐步升级对话内容,从而成功实现监狱突破。我们在多个公共系统上评估了Crescendo的效果,结果显示其在所有评估模型和任务中均表现出高效的攻击成功率。此外,我们还展示了自动化Crescendo攻击的工具Crescendomation,其在AdvBench子集数据集上超越了其他最先进的监狱突破技术,表现出显著的性能提升。

🔬 方法详解

问题定义:本文旨在解决现有监狱突破方法复杂且易被检测的问题,现有方法在突破大型语言模型的安全防护时效果有限。

核心思路:Crescendo攻击通过与模型进行多轮对话,逐步引导模型进入监狱突破状态,设计上采用了看似无害的初始提示,避免了直接的攻击痕迹。

技术框架:Crescendo的整体架构包括初始的普通提示、逐步升级的对话策略和最终的监狱突破实现。主要模块包括对话生成、模型响应分析和攻击策略调整。

关键创新:Crescendo的创新在于其多轮对话的设计,使得攻击过程更为隐蔽且有效,区别于以往直接的攻击方式。

关键设计:在参数设置上,Crescendo采用了动态调整的对话策略,结合模型的实时反馈,优化了攻击路径,确保了高成功率。

🖼️ 关键图片

fig_0
fig_1
fig_2

📊 实验亮点

Crescendo攻击在多个大型语言模型上表现出色,尤其是在GPT-4和Gemini-Pro上,攻击成功率提升幅度分别达到29-61%和49-71%。此外,Crescendomation工具在AdvBench子集数据集上超越了其他最先进的监狱突破技术,展示了其强大的实用性。

🎯 应用场景

该研究的潜在应用领域包括安全研究、人工智能伦理和模型对抗性测试。通过理解和突破大型语言模型的安全防护,可以为模型的改进和安全性提升提供重要参考,同时也为AI的负责任使用提供了新的视角。

📄 摘要(原文)

Large Language Models (LLMs) have risen significantly in popularity and are increasingly being adopted across multiple applications. These LLMs are heavily aligned to resist engaging in illegal or unethical topics as a means to avoid contributing to responsible AI harms. However, a recent line of attacks, known as jailbreaks, seek to overcome this alignment. Intuitively, jailbreak attacks aim to narrow the gap between what the model can do and what it is willing to do. In this paper, we introduce a novel jailbreak attack called Crescendo. Unlike existing jailbreak methods, Crescendo is a simple multi-turn jailbreak that interacts with the model in a seemingly benign manner. It begins with a general prompt or question about the task at hand and then gradually escalates the dialogue by referencing the model's replies progressively leading to a successful jailbreak. We evaluate Crescendo on various public systems, including ChatGPT, Gemini Pro, Gemini-Ultra, LlaMA-2 70b and LlaMA-3 70b Chat, and Anthropic Chat. Our results demonstrate the strong efficacy of Crescendo, with it achieving high attack success rates across all evaluated models and tasks. Furthermore, we present Crescendomation, a tool that automates the Crescendo attack and demonstrate its efficacy against state-of-the-art models through our evaluations. Crescendomation surpasses other state-of-the-art jailbreaking techniques on the AdvBench subset dataset, achieving 29-61% higher performance on GPT-4 and 49-71% on Gemini-Pro. Finally, we also demonstrate Crescendo's ability to jailbreak multimodal models.