ClawGuard: A Runtime Security Framework for Tool-Augmented LLM Agents Against Indirect Prompt Injection

作者: Wei Zhao, Zhe Li, Peixin Zhang, Jun Sun

分类: cs.CR, cs.AI

发布日期: 2026-04-13

🔗 代码/项目: GITHUB

💡 一句话要点

ClawGuard：针对工具增强型LLM Agent的运行时安全框架，防御间接Prompt注入攻击

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: LLM Agent Prompt注入攻击 运行时安全 访问控制 工具调用 安全框架

📋 核心要点

现有工具增强型LLM Agent易受间接Prompt注入攻击，攻击者可利用工具返回内容注入恶意指令。
ClawGuard通过在工具调用边界强制执行用户确认的规则集，实现确定性、可审计的防御机制。
实验表明，ClawGuard在不影响Agent效用的前提下，有效防御间接Prompt注入，无需模型或架构修改。

📝 摘要（中文）

工具增强型大型语言模型（LLM）Agent在自动化复杂、多步骤的现实世界任务中表现出令人印象深刻的能力，但仍然容易受到间接Prompt注入攻击。攻击者通过在工具返回的内容中嵌入恶意指令来利用此弱点，Agent直接将这些内容作为可信的观察结果纳入其对话历史记录中。这种漏洞主要表现在三个攻击渠道：Web和本地内容注入、MCP服务器注入以及技能文件注入。为了解决这些漏洞，我们引入了ClawGuard，这是一种新颖的运行时安全框架，它在每个工具调用边界强制执行用户确认的规则集，将不可靠的、依赖对齐的防御转换为确定性的、可审计的机制，从而在产生任何实际影响之前拦截对抗性工具调用。通过在任何外部工具调用之前，从用户声明的目标自动导出特定于任务的访问约束，ClawGuard可以阻止所有三个注入途径，而无需模型修改或基础设施更改。在AgentDojo、SkillInject和MCPSafeBench上对五种最先进的语言模型进行的实验表明，ClawGuard实现了对间接Prompt注入的强大保护，而不会影响Agent的效用。这项工作确立了确定性工具调用边界强制执行作为安全Agentic AI系统的有效防御机制，既不需要安全特定的微调，也不需要架构修改。

🔬 方法详解

问题定义：论文旨在解决工具增强型LLM Agent中存在的间接Prompt注入漏洞。现有方法依赖于模型的对齐，防御效果不稳定且难以审计。攻击者可以通过Web内容、本地文件、MCP服务器和技能文件等多种渠道注入恶意指令，导致Agent执行非预期行为。

核心思路：ClawGuard的核心思路是在每个工具调用边界强制执行用户确认的规则集，将防御机制从依赖模型对齐转变为确定性的、可审计的访问控制。通过预先定义任务相关的访问约束，拦截任何违反规则的工具调用，从而阻止恶意指令的注入。

技术框架：ClawGuard是一个运行时安全框架，位于LLM Agent和外部工具之间。其主要流程包括：1) 用户定义任务目标；2) ClawGuard自动导出任务特定的访问约束；3) Agent发起工具调用；4) ClawGuard根据访问约束检查工具调用请求；5) 如果请求违反约束，则拦截该调用；6) 如果请求符合约束，则允许调用并返回结果给Agent。

关键创新：ClawGuard的关键创新在于其确定性的工具调用边界强制执行机制。与依赖模型自身安全性的方法不同，ClawGuard通过预定义的规则集进行访问控制，从而提供更可靠、可预测的防御效果。此外，ClawGuard无需修改模型或基础设施，易于部署和集成。

关键设计：ClawGuard的关键设计包括：1) 访问约束的自动导出，确保约束与任务目标一致；2) 细粒度的访问控制，可以限制Agent对特定工具、API或数据的访问；3) 可审计的日志记录，方便追踪和分析安全事件。具体参数设置和实现细节未在论文中详细描述，属于实现层面的内容。

🖼️ 关键图片

📊 实验亮点

实验结果表明，ClawGuard在AgentDojo、SkillInject和MCPSafeBench等基准测试中，对五种最先进的语言模型实现了强大的间接Prompt注入防御，且未显著降低Agent的效用。具体性能数据和提升幅度未在摘要中明确给出，需要在论文正文中查找。

🎯 应用场景

ClawGuard可应用于各种需要安全Agentic AI系统的场景，例如自动化客户服务、智能家居控制、金融交易等。通过提供可靠的防御机制，ClawGuard能够降低Agent被恶意利用的风险，提高系统的安全性和可信度，促进Agentic AI技术的广泛应用。

📄 摘要（原文）

Tool-augmented Large Language Model (LLM) agents have demonstrated impressive capabilities in automating complex, multi-step real-world tasks, yet remain vulnerable to indirect prompt injection. Adversaries exploit this weakness by embedding malicious instructions within tool-returned content, which agents directly incorporate into their conversation history as trusted observations. This vulnerability manifests across three primary attack channels: web and local content injection, MCP server injection, and skill file injection. To address these vulnerabilities, we introduce \textsc{ClawGuard}, a novel runtime security framework that enforces a user-confirmed rule set at every tool-call boundary, transforming unreliable alignment-dependent defense into a deterministic, auditable mechanism that intercepts adversarial tool calls before any real-world effect is produced. By automatically deriving task-specific access constraints from the user's stated objective prior to any external tool invocation, \textsc{ClawGuard} blocks all three injection pathways without model modification or infrastructure change. Experiments across five state-of-the-art language models on AgentDojo, SkillInject, and MCPSafeBench demonstrate that \textsc{ClawGuard} achieves robust protection against indirect prompt injection without compromising agent utility. This work establishes deterministic tool-call boundary enforcement as an effective defense mechanism for secure agentic AI systems, requiring neither safety-specific fine-tuning nor architectural modification. Code is publicly available at https://github.com/Claw-Guard/ClawGuard.

ClawGuard: A Runtime Security Framework for Tool-Augmented LLM Agents Against Indirect Prompt Injection

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理