h4rm3l: A language for Composable Jailbreak Attack Synthesis
作者: Moussa Koulako Bala Doumbouya, Ananjan Nandi, Gabriel Poesia, Davide Ghilardi, Anna Goldie, Federico Bianchi, Dan Jurafsky, Christopher D. Manning
分类: cs.CR, cs.AI, cs.CL, cs.CY, cs.LG
发布日期: 2024-08-09 (更新: 2025-03-25)
备注: Accepted to the Thirteenth International Conference on Learning Representations (ICLR 2025)
💡 一句话要点
提出h4rm3l语言,用于可组合的越狱攻击合成,提升大语言模型安全性评估。
🎯 匹配领域: 支柱九:具身大模型 (Embodied Foundation Models)
关键词: 大语言模型安全 越狱攻击 领域特定语言 程序合成 红队测试
📋 核心要点
- 现有LLM安全评估方法难以覆盖足够多样的越狱攻击,导致不安全模型被广泛应用。
- h4rm3l通过领域特定语言形式化表达越狱攻击,并利用程序合成方法探索攻击组合空间。
- 实验证明,h4rm3l合成的攻击比现有攻击更有效,在SOTA LLM上的成功率超过90%。
📝 摘要(中文)
当前的大语言模型(LLM)安全性评估方法,如模板提示数据集和基准测试流程,无法充分覆盖大量且多样的越狱攻击,导致不安全LLM的广泛部署。本文提出h4rm3l,一种用于越狱攻击的可组合表示的领域特定语言(DSL)。该框架包含:(1) h4rm3l DSL,将越狱攻击正式表达为参数化字符串转换原语的组合。(2) 一个带有bandit算法的合成器,可以高效地生成针对目标黑盒LLM优化的越狱攻击。(3) h4rm3l红队软件工具包,利用前两个组件和一个与人类判断高度一致的自动有害LLM行为分类器。实验表明,h4rm3l能够合成大量有效的越狱攻击,并在SOTA LLM上超越现有攻击。
🔬 方法详解
问题定义:论文旨在解决现有LLM安全评估方法无法有效识别和防御越狱攻击的问题。现有方法依赖于人工设计的模板或简单的变异,难以覆盖攻击空间的多样性,导致模型存在安全漏洞。这些漏洞可能被恶意利用,造成社会危害。
核心思路:论文的核心思路是形式化地表示越狱攻击,将其分解为可组合的字符串转换原语,并设计一种领域特定语言(DSL)来描述这些攻击。通过组合这些原语,可以自动生成大量不同的攻击,从而更全面地评估LLM的安全性。
技术框架:h4rm3l框架包含三个主要组件:(1) h4rm3l DSL,用于定义越狱攻击;(2) 一个基于bandit算法的合成器,用于自动生成和优化攻击;(3) h4rm3l红队工具包,集成了DSL、合成器和一个自动有害行为分类器。该工具包可以自动化地评估LLM的安全性,并发现潜在的漏洞。
关键创新:最重要的创新在于提出了h4rm3l DSL,它提供了一种形式化的、可组合的方式来表示越狱攻击。与以往依赖人工设计的攻击方法不同,h4rm3l允许通过程序合成自动生成大量不同的攻击,从而更全面地覆盖攻击空间。
关键设计:h4rm3l DSL包含一系列参数化的字符串转换原语,例如插入、替换、删除等。合成器使用bandit算法来探索不同的原语组合,并根据攻击的成功率来优化组合策略。有害行为分类器用于判断生成的攻击是否成功绕过了LLM的安全机制。具体参数设置和损失函数等细节在论文中未详细说明,属于未知内容。
🖼️ 关键图片
📊 实验亮点
实验结果表明,h4rm3l能够合成大量有效的越狱攻击,并在6个SOTA LLM(包括开源和专有模型)上进行了测试。合成的攻击比现有攻击更有效,在SOTA LLM上的成功率超过90%。h4rm3l成功地发现了现有安全评估方法未能发现的漏洞。
🎯 应用场景
h4rm3l可用于自动化评估和提升大语言模型的安全性,帮助开发者发现和修复潜在的安全漏洞。该工具可以应用于红队测试、安全审计和模型训练等场景,提高LLM在实际应用中的可靠性和安全性,降低潜在的社会危害。
📄 摘要(原文)
Despite their demonstrated valuable capabilities, state-of-the-art (SOTA) widely deployed large language models (LLMs) still have the potential to cause harm to society due to the ineffectiveness of their safety filters, which can be bypassed by prompt transformations called jailbreak attacks. Current approaches to LLM safety assessment, which employ datasets of templated prompts and benchmarking pipelines, fail to cover sufficiently large and diverse sets of jailbreak attacks, leading to the widespread deployment of unsafe LLMs. Recent research showed that novel jailbreak attacks could be derived by composition; however, a formal composable representation for jailbreak attacks, which, among other benefits, could enable the exploration of a large compositional space of jailbreak attacks through program synthesis methods, has not been previously proposed. We introduce h4rm3l, a novel approach that addresses this gap with a human-readable domain-specific language (DSL). Our framework comprises: (1) The h4rm3l DSL, which formally expresses jailbreak attacks as compositions of parameterized string transformation primitives. (2) A synthesizer with bandit algorithms that efficiently generates jailbreak attacks optimized for a target black box LLM. (3) The h4rm3l red-teaming software toolkit that employs the previous two components and an automated harmful LLM behavior classifier that is strongly aligned with human judgment. We demonstrate h4rm3l's efficacy by synthesizing a dataset of 2656 successful novel jailbreak attacks targeting 6 SOTA open-source and proprietary LLMs, and by benchmarking those models against a subset of these synthesized attacks. Our results show that h4rm3l's synthesized attacks are diverse and more successful than existing jailbreak attacks in literature, with success rates exceeding 90% on SOTA LLMs.