LlamaFirewall: An open source guardrail system for building secure AI agents

作者: Sahana Chennabasappa, Cyrus Nikolaidis, Daniel Song, David Molnar, Stephanie Ding, Shengye Wan, Spencer Whitman, Lauren Deason, Nicholas Doucette, Abraham Montilla, Alekhya Gampa, Beto de Paola, Dominik Gabi, James Crnkovich, Jean-Christophe Testud, Kat He, Rashnil Chaturvedi, Wu Zhou, Joshua Saxe

分类: cs.CR, cs.AI

发布日期: 2025-05-06

💡 一句话要点

LlamaFirewall：用于构建安全AI Agent的开源安全防护系统

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: AI Agent安全 安全防护框架 提示注入防御 代码安全 越狱检测 静态代码分析 LLM安全

📋 核心要点

现有安全措施（如模型微调和聊天机器人安全防护）无法充分应对AI Agent带来的新型安全风险，例如提示注入、Agent目标错位和不安全代码生成。
LlamaFirewall 旨在作为AI Agent的最后一道防线，通过 PromptGuard 2、Agent Alignment Checks 和 CodeShield 三个模块，实时监控并防御安全风险。
实验结果表明，PromptGuard 2 在越狱检测方面表现出色，Agent Alignment Checks 在防止间接注入方面优于现有方法，CodeShield 能够快速检测不安全代码。

📝 摘要（中文）

大型语言模型（LLM）已从简单的聊天机器人发展为能够执行复杂任务的自主Agent，例如编辑生产代码、编排工作流程以及根据不可信输入（如网页和电子邮件）采取高风险行动。这些能力引入了新的安全风险，而现有的安全措施（如模型微调或以聊天机器人为中心的防护措施）并不能完全解决这些风险。鉴于更高的风险以及缺乏减轻这些风险的确定性解决方案，迫切需要一个实时安全防护监控器作为最后一层防御，并支持系统级、用例特定的安全策略定义和执行。我们介绍了LlamaFirewall，这是一个以安全为中心的开源安全防护框架，旨在作为防御与AI Agent相关的安全风险的最后一层。我们的框架通过三个强大的安全防护措施来降低风险：PromptGuard 2，一种通用的越狱检测器，展示了明确的最新性能；Agent Alignment Checks，一种思维链审计器，用于检查Agent推理中的提示注入和目标不一致性，虽然仍处于实验阶段，但在一般场景中比以前提出的方法更有效地防止间接注入；以及CodeShield，一种快速且可扩展的在线静态分析引擎，旨在防止编码Agent生成不安全或危险的代码。此外，我们还包括易于使用的可定制扫描器，使任何能够编写正则表达式或LLM提示的开发人员都能够快速更新Agent的安全防护措施。

🔬 方法详解

问题定义：论文旨在解决大型语言模型驱动的AI Agent在执行复杂任务时面临的安全风险，例如提示注入、Agent目标错位和不安全代码生成。现有安全措施，如模型微调和面向聊天机器人的安全防护，无法充分解决这些风险，缺乏实时监控和系统级的安全策略定义与执行能力。

核心思路：LlamaFirewall 的核心思路是构建一个开源的安全防护框架，作为AI Agent的最后一道防线。它通过三个关键模块：PromptGuard 2、Agent Alignment Checks 和 CodeShield，分别检测和防御提示注入、Agent目标错位以及不安全代码生成。这种多层次的安全防护体系旨在提供更全面的安全保障。

技术框架：LlamaFirewall 包含以下主要模块： 1. PromptGuard 2：用于检测针对LLM的越狱攻击，防止恶意用户通过精心设计的提示绕过安全限制。 2. Agent Alignment Checks：通过思维链审计，检查Agent的推理过程，识别提示注入和目标错位等问题。 3. CodeShield：一个在线静态分析引擎，用于检测Agent生成的代码中存在的安全漏洞和潜在风险。此外，该框架还提供易于使用的可定制扫描器，允许开发者根据特定需求快速更新Agent的安全防护措施。

关键创新：LlamaFirewall 的关键创新在于其多层次、实时的安全防护体系，以及针对AI Agent特定安全风险设计的检测模块。PromptGuard 2 提供了先进的越狱检测能力，Agent Alignment Checks 能够有效防止间接注入，CodeShield 则能够在线检测不安全代码。此外，开源的设计使得开发者可以方便地定制和扩展该框架。

关键设计： * PromptGuard 2：采用未知技术实现，但宣称具有最先进的越狱检测性能。 * Agent Alignment Checks：使用思维链（Chain-of-Thought）方法进行推理审计，具体实现细节未知。 * CodeShield：采用在线静态分析技术，具体分析规则和算法未知。 * 可定制扫描器：允许开发者使用正则表达式或LLM提示定义新的安全规则，具体实现方式未知。

🖼️ 关键图片

📊 实验亮点

论文提出了 PromptGuard 2，一种通用的越狱检测器，展示了最先进的性能。Agent Alignment Checks 在防止间接注入方面比以前提出的方法更有效。CodeShield 能够快速且可扩展地进行在线静态分析，防止生成不安全的代码。这些模块共同构成了 LlamaFirewall 的核心优势。

🎯 应用场景

LlamaFirewall 可应用于各种需要安全AI Agent的场景，例如自动化代码生成、智能工作流编排、以及基于不可信数据源的决策系统。通过提供实时安全监控和防御，该框架可以降低AI Agent被恶意利用的风险，提高系统的安全性和可靠性，并促进AI技术的更广泛应用。

📄 摘要（原文）

Large language models (LLMs) have evolved from simple chatbots into autonomous agents capable of performing complex tasks such as editing production code, orchestrating workflows, and taking higher-stakes actions based on untrusted inputs like webpages and emails. These capabilities introduce new security risks that existing security measures, such as model fine-tuning or chatbot-focused guardrails, do not fully address. Given the higher stakes and the absence of deterministic solutions to mitigate these risks, there is a critical need for a real-time guardrail monitor to serve as a final layer of defense, and support system level, use case specific safety policy definition and enforcement. We introduce LlamaFirewall, an open-source security focused guardrail framework designed to serve as a final layer of defense against security risks associated with AI Agents. Our framework mitigates risks such as prompt injection, agent misalignment, and insecure code risks through three powerful guardrails: PromptGuard 2, a universal jailbreak detector that demonstrates clear state of the art performance; Agent Alignment Checks, a chain-of-thought auditor that inspects agent reasoning for prompt injection and goal misalignment, which, while still experimental, shows stronger efficacy at preventing indirect injections in general scenarios than previously proposed approaches; and CodeShield, an online static analysis engine that is both fast and extensible, aimed at preventing the generation of insecure or dangerous code by coding agents. Additionally, we include easy-to-use customizable scanners that make it possible for any developer who can write a regular expression or an LLM prompt to quickly update an agent's security guardrails.

LlamaFirewall: An open source guardrail system for building secure AI agents

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理