SecAlign: Defending Against Prompt Injection with Preference Optimization

作者: Sizhe Chen, Arman Zharmagambetov, Saeed Mahloujifar, Kamalika Chaudhuri, David Wagner, Chuan Guo

分类: cs.CR, cs.LG

发布日期: 2024-10-07 (更新: 2025-07-03)

备注: ACM CCS 2025. Key words: prompt injection defense, LLM security, LLM-integrated applications

DOI: 10.1145/3719027.3744836

🔗 代码/项目: GITHUB

💡 一句话要点

SecAlign：利用偏好优化防御大语言模型的提示注入攻击

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 提示注入攻击 大语言模型安全 偏好优化 对抗防御 安全对齐

📋 核心要点

大型语言模型易受提示注入攻击，攻击者通过篡改外部数据源，诱导模型执行恶意指令。
SecAlign通过构建偏好数据集，利用偏好优化训练模型，使其倾向于安全输出而非受攻击的输出。
实验表明，SecAlign能有效防御各种提示注入攻击，成功率低于10%，且模型效用与防御前相似。

📝 摘要（中文）

大型语言模型（LLM）在现代软件系统中日益普及，它们作为用户和互联网之间的接口，协助处理需要高级语言理解的任务。为了完成这些任务，LLM通常使用外部数据源，如用户文档、网络检索、API调用结果等。这为攻击者通过提示注入操纵LLM开辟了新的途径。对抗性提示可以被注入到外部数据源中，以覆盖系统预期的指令，转而执行恶意指令。为了缓解这种漏洞，我们提出了一种名为SecAlign的新防御方法，该方法基于偏好优化技术。我们的防御首先构建一个包含提示注入输入、安全输出（响应合法指令的输出）和不安全输出（响应注入的输出）的偏好数据集。然后，我们对该数据集执行偏好优化，以教导LLM偏好安全输出而不是不安全输出。这是第一个已知的方法，可以将各种提示注入的成功率降低到<10%，即使是针对比训练期间看到的攻击更复杂的攻击。这表明我们的防御可以很好地推广到未知和未来的攻击。此外，在我们的评估中，SecAlign模型仍然实用，与防御训练前的模型具有相似的效用。我们的代码位于https://github.com/facebookresearch/SecAlign。

🔬 方法详解

问题定义：论文旨在解决大型语言模型（LLM）中存在的提示注入漏洞。现有的LLM在处理来自外部数据源的信息时，容易受到恶意构造的提示的攻击，这些提示可以覆盖原始指令，导致模型执行非预期的、甚至有害的操作。现有的防御方法往往难以泛化到未知的攻击模式，且可能影响模型的正常功能。

核心思路：SecAlign的核心思路是利用偏好优化，使LLM学会区分并偏好安全的、符合原始指令的输出，而不是受到提示注入影响的输出。通过构建包含安全和不安全输出的偏好数据集，并训练模型对安全输出赋予更高的偏好，从而提高模型对提示注入攻击的鲁棒性。

技术框架：SecAlign的整体框架包括以下几个主要步骤：1) 构建偏好数据集：该数据集包含提示注入的输入、对应的安全输出（符合原始指令）和不安全输出（受到提示注入影响）。2) 偏好优化：使用偏好数据集训练LLM，使其学会对安全输出赋予更高的偏好。这通常通过调整模型的参数来实现，使得模型在给定输入时，更有可能生成或选择安全输出。3) 评估：评估SecAlign模型在各种提示注入攻击下的性能，以及其在正常任务中的效用。

关键创新：SecAlign的关键创新在于其利用偏好优化来防御提示注入攻击。与传统的对抗训练方法不同，SecAlign不是直接训练模型识别和拒绝恶意提示，而是通过学习偏好来间接提高模型的安全性。这种方法更具有泛化性，可以有效防御未知的攻击模式。

关键设计：SecAlign的关键设计包括：1) 偏好数据集的构建：需要精心设计提示注入攻击，并生成对应的安全和不安全输出。2) 偏好优化算法的选择：可以使用各种偏好学习算法，如pairwise ranking loss等。3) 模型架构的选择：SecAlign可以应用于各种LLM架构，如Transformer等。4) 超参数的调整：需要根据具体的任务和数据集调整偏好优化算法的超参数，以获得最佳的性能。

🖼️ 关键图片

📊 实验亮点

SecAlign在实验中表现出色，能够将各种提示注入攻击的成功率降低到10%以下，即使是面对比训练期间遇到的攻击更复杂的攻击。这表明SecAlign具有良好的泛化能力，可以有效防御未知的攻击模式。此外，SecAlign模型在防御攻击的同时，仍然保持了与防御训练前相似的效用，确保了模型的实用性。

🎯 应用场景

SecAlign可应用于各种使用大型语言模型的软件系统，尤其是在需要处理来自不可信来源数据的场景中，例如：智能助手、聊天机器人、搜索引擎等。通过提高LLM对提示注入攻击的鲁棒性，SecAlign可以保护用户免受恶意攻击，并确保系统的安全性和可靠性。该研究的未来影响在于推动LLM安全性的发展，使其能够更安全地应用于各种实际场景。

📄 摘要（原文）

Large language models (LLMs) are becoming increasingly prevalent in modern software systems, interfacing between the user and the Internet to assist with tasks that require advanced language understanding. To accomplish these tasks, the LLM often uses external data sources such as user documents, web retrieval, results from API calls, etc. This opens up new avenues for attackers to manipulate the LLM via prompt injection. Adversarial prompts can be injected into external data sources to override the system's intended instruction and instead execute a malicious instruction. To mitigate this vulnerability, we propose a new defense called SecAlign based on the technique of preference optimization. Our defense first constructs a preference dataset with prompt-injected inputs, secure outputs (ones that respond to the legitimate instruction), and insecure outputs (ones that respond to the injection). We then perform preference optimization on this dataset to teach the LLM to prefer the secure output over the insecure one. This provides the first known method that reduces the success rates of various prompt injections to <10%, even against attacks much more sophisticated than ones seen during training. This indicates our defense generalizes well against unknown and yet-to-come attacks. Also, SecAlign models are still practical with similar utility to the one before defensive training in our evaluations. Our code is at https://github.com/facebookresearch/SecAlign

SecAlign: Defending Against Prompt Injection with Preference Optimization

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理