Feint and Attack: Attention-Based Strategies for Jailbreaking and Protecting LLMs

作者: Rui Pu, Chaozhuo Li, Rui Ha, Zejian Chen, Litian Zhang, Zheng Liu, Lirong Qiu, Zaisheng Ye

分类: cs.CR, cs.AI, cs.CL

发布日期: 2024-10-18 (更新: 2025-07-08)

💡 一句话要点

提出基于注意力机制的“佯攻”策略，用于LLM的越狱攻击与防御。

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 大型语言模型 越狱攻击 注意力机制 安全防御 提示工程

📋 核心要点

现有越狱攻击方法依赖于构造语义模糊的提示，缺乏对LLM内部注意力机制的深入理解和利用。
论文提出基于注意力机制的攻击（ABA）和防御（ABD）策略，通过操纵LLM的注意力分布实现越狱和防御。
实验结果表明，ABA能够有效进行越狱攻击，ABD能够有效提升LLM的鲁棒性，验证了注意力分布对LLM输出的影响。

📝 摘要（中文）

本文研究了利用语义模糊的提示语诱导大型语言模型（LLMs）生成有害内容的越狱攻击。为了评估LLMs的安全性并揭示输入提示与输出之间的内在联系，引入注意力权重分布进行分析。通过统计分析方法，定义了新的指标来描述注意力权重分布，如敏感词注意力强度（Attn_SensWords）、基于注意力的上下文依赖得分（Attn_DepScore）和注意力分散熵（Attn_Entropy）。利用这些指标的特性，并受“佯攻”军事策略的启发，提出了一种有效的越狱攻击策略，即基于注意力的攻击（ABA）。ABA采用嵌套攻击提示来转移LLMs的注意力分布，使输入中更多无害部分吸引LLMs的注意力。此外，受ABA的启发，还提出了一种有效的防御策略，即基于注意力的防御（ABD），通过校准输入提示的注意力分布来增强LLMs的鲁棒性。对比实验表明了ABA和ABD的有效性，并逻辑地解释了注意力权重分布对LLMs输出的巨大影响。

🔬 方法详解

问题定义：当前针对大型语言模型（LLMs）的越狱攻击主要依赖于构造语义模糊的提示语，诱导LLMs生成有害内容。现有方法缺乏对LLMs内部注意力机制的深入理解，难以有效地进行攻击和防御。因此，如何利用LLMs的注意力机制，设计更有效的攻击和防御策略是一个关键问题。

核心思路：本文的核心思路是利用LLMs的注意力机制，通过操纵输入提示的注意力分布来实现越狱攻击和防御。攻击策略（ABA）通过嵌套攻击提示来转移LLMs的注意力，使LLMs更多地关注输入中的无害部分，从而绕过安全限制。防御策略（ABD）则通过校准输入提示的注意力分布，增强LLMs对恶意提示的鲁棒性。

技术框架：整体框架包含两个主要部分：基于注意力的攻击（ABA）和基于注意力的防御（ABD）。ABA首先通过分析LLMs的注意力分布，确定敏感词和上下文依赖关系，然后构造嵌套攻击提示，转移LLMs的注意力。ABD则通过校准输入提示的注意力分布，降低LLMs对恶意提示的敏感性。

关键创新：本文的关键创新在于将注意力机制引入到LLMs的越狱攻击和防御中。通过定义新的指标来描述注意力权重分布，如敏感词注意力强度（Attn_SensWords）、基于注意力的上下文依赖得分（Attn_DepScore）和注意力分散熵（Attn_Entropy），可以更精确地分析LLMs的注意力行为。与现有方法相比，本文提出的ABA和ABD策略能够更有效地进行攻击和防御。

关键设计：ABA的关键设计在于嵌套攻击提示的构造，通过在提示中插入无害信息，分散LLMs对敏感词的注意力。ABD的关键设计在于注意力分布的校准，通过调整输入提示的词向量表示，降低LLMs对恶意提示的敏感性。具体的参数设置和损失函数细节在论文中未明确给出，属于未知信息。

🖼️ 关键图片

fig_0

fig_1

📊 实验亮点

实验结果表明，提出的ABA策略能够有效进行越狱攻击，成功率高于现有方法。同时，ABD策略能够显著提升LLMs的鲁棒性，降低其对恶意提示的敏感性。这些结果验证了注意力机制在LLMs安全中的重要作用，并为未来的研究提供了新的方向。

🎯 应用场景

该研究成果可应用于评估和提升大型语言模型的安全性。通过ABA策略可以发现LLMs的潜在漏洞，而ABD策略可以增强LLMs的鲁棒性，防止恶意利用。这对于构建安全可靠的AI系统具有重要意义，尤其是在涉及敏感信息处理和决策的场景中。

📄 摘要（原文）

Jailbreak attack can be used to access the vulnerabilities of Large Language Models (LLMs) by inducing LLMs to generate the harmful content. And the most common method of the attack is to construct semantically ambiguous prompts to confuse and mislead the LLMs. To access the security and reveal the intrinsic relation between the input prompt and the output for LLMs, the distribution of attention weight is introduced to analyze the underlying reasons. By using statistical analysis methods, some novel metrics are defined to better describe the distribution of attention weight, such as the Attention Intensity on Sensitive Words (Attn_SensWords), the Attention-based Contextual Dependency Score (Attn_DepScore) and Attention Dispersion Entropy (Attn_Entropy). By leveraging the distinct characteristics of these metrics, the beam search algorithm and inspired by the military strategy "Feint and Attack", an effective jailbreak attack strategy named as Attention-Based Attack (ABA) is proposed. In the ABA, nested attack prompts are employed to divert the attention distribution of the LLMs. In this manner, more harmless parts of the input can be used to attract the attention of the LLMs. In addition, motivated by ABA, an effective defense strategy called as Attention-Based Defense (ABD) is also put forward. Compared with ABA, the ABD can be used to enhance the robustness of LLMs by calibrating the attention distribution of the input prompt. Some comparative experiments have been given to demonstrate the effectiveness of ABA and ABD. Therefore, both ABA and ABD can be used to access the security of the LLMs. The comparative experiment results also give a logical explanation that the distribution of attention weight can bring great influence on the output for LLMs.