VeriGrey: Greybox Agent Validation

作者: Yuntong Zhang, Sungmin Kang, Ruijie Meng, Marcel Böhme, Abhik Roychoudhury

分类: cs.AI

发布日期: 2026-03-18

💡 一句话要点

VeriGrey：一种灰盒方法，用于验证LLM Agent并发现安全风险。

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: LLM Agent 灰盒测试 安全验证 提示注入 漏洞检测

📋 核心要点

LLM Agent与外部环境的自主交互带来了严重的安全风险，现有方法难以有效识别和缓解这些风险。
VeriGrey通过灰盒测试，利用工具调用序列作为反馈，并结合变异提示生成恶意注入，从而发现Agent中的安全漏洞。
实验表明，VeriGrey在发现间接提示注入漏洞方面优于黑盒方法，并在实际应用中发现了多种攻击场景。

📝 摘要（中文）

本文提出了一种灰盒方法VeriGrey，用于探索LLM Agent的多样行为并发现其中的安全风险。VeriGrey利用Agent调用的工具序列作为反馈函数来驱动测试过程，从而揭示那些不常见但危险的工具调用，这些调用可能导致Agent出现意外行为。在测试过程中，通过变异提示来设计有害的注入提示，具体方法是将Agent的任务与注入任务相关联，使注入任务成为完成Agent功能的必要步骤。在AgentDojo基准测试中，与黑盒基线相比，VeriGrey在使用GPT-4.1后端时，在发现间接提示注入漏洞方面的效率提高了33%。此外，通过对Gemini CLI和OpenClaw等实际应用进行案例研究，VeriGrey发现了能够诱导攻击场景的提示，这些场景无法通过黑盒方法识别。在OpenClaw中，VeriGrey通过构建一个对话Agent，并根据需要采用变异模糊测试，能够从10个恶意技能中发现恶意技能变体（在Kimi-K2.5 LLM后端上的成功率为10/10=100%，在Opus 4.6 LLM后端上的成功率为9/10=90%）。这证明了像VeriGrey这样的动态方法在测试Agent方面的价值，并最终促成Agent保障框架的建立。

🔬 方法详解

问题定义：当前LLM Agent面临着严重的安全风险，特别是由于其与外部环境的自主交互。现有的黑盒测试方法难以有效地发现和利用Agent中的潜在漏洞，尤其是一些不常见但危险的工具调用序列，这些序列可能导致Agent出现意想不到的行为。因此，需要一种更有效的方法来验证Agent的安全性并发现其中的安全风险。

核心思路：VeriGrey的核心思路是采用灰盒测试方法，将Agent调用的工具序列作为反馈函数来指导测试过程。通过观察Agent在执行任务时调用的工具序列，可以更深入地了解Agent的行为模式，并发现潜在的异常行为。此外，VeriGrey还通过变异提示来设计有害的注入提示，从而触发Agent中的漏洞。

技术框架：VeriGrey的整体框架包括以下几个主要模块：1) 任务定义模块：定义Agent需要完成的任务；2) 提示生成模块：生成初始提示，并根据反馈函数进行变异；3) Agent执行模块：执行Agent，并记录其调用的工具序列；4) 漏洞检测模块：分析工具序列，并检测潜在的漏洞。整个流程是一个迭代的过程，通过不断地生成新的提示并执行Agent，可以逐步探索Agent的行为空间，并发现其中的安全风险。

关键创新：VeriGrey最重要的技术创新点在于其灰盒测试方法，该方法利用Agent调用的工具序列作为反馈函数，从而更有效地发现Agent中的安全漏洞。与传统的黑盒测试方法相比，VeriGrey可以更深入地了解Agent的行为模式，并发现潜在的异常行为。此外，VeriGrey还通过变异提示来设计有害的注入提示，从而触发Agent中的漏洞。

关键设计：VeriGrey的关键设计包括：1) 工具序列的表示方法：如何有效地表示Agent调用的工具序列，以便进行分析和比较；2) 变异算子的设计：如何设计有效的变异算子，以生成有害的注入提示；3) 漏洞检测算法：如何设计高效的漏洞检测算法，以检测潜在的漏洞。这些设计细节对于VeriGrey的性能和效果至关重要。

🖼️ 关键图片

📊 实验亮点

在AgentDojo基准测试中，VeriGrey在使用GPT-4.1后端时，在发现间接提示注入漏洞方面的效率比黑盒基线提高了33%。在OpenClaw的案例研究中，VeriGrey能够从10个恶意技能中发现恶意技能变体（在Kimi-K2.5 LLM后端上的成功率为100%，在Opus 4.6 LLM后端上的成功率为90%）。

🎯 应用场景

VeriGrey可应用于各种LLM Agent的安全测试和验证，例如代码生成Agent、个人助理Agent等。通过使用VeriGrey，可以有效地发现Agent中的安全漏洞，并提高Agent的安全性。该研究有助于构建更可靠、更安全的Agentic AI系统，并促进Agent技术的广泛应用。

📄 摘要（原文）

Agentic AI has been a topic of great interest recently. A Large Language Model (LLM) agent involves one or more LLMs in the back-end. In the front end, it conducts autonomous decision-making by combining the LLM outputs with results obtained by invoking several external tools. The autonomous interactions with the external environment introduce critical security risks. In this paper, we present a grey-box approach to explore diverse behaviors and uncover security risks in LLM agents. Our approach VeriGrey uses the sequence of tools invoked as a feedback function to drive the testing process. This helps uncover infrequent but dangerous tool invocations that cause unexpected agent behavior. As mutation operators in the testing process, we mutate prompts to design pernicious injection prompts. This is carefully accomplished by linking the task of the agent to an injection task, so that the injection task becomes a necessary step of completing the agent functionality. Comparing our approach with a black-box baseline on the well-known AgentDojo benchmark, VeriGrey achieves 33% additional efficacy in finding indirect prompt injection vulnerabilities with a GPT-4.1 back-end. We also conduct real-world case studies with the widely used coding agent Gemini CLI, and the well-known OpenClaw personal assistant. VeriGrey finds prompts inducing several attack scenarios that could not be identified by black-box approaches. In OpenClaw, by constructing a conversation agent which employs mutational fuzz testing as needed, VeriGrey is able to discover malicious skill variants from 10 malicious skills (with 10/10= 100% success rate on the Kimi-K2.5 LLM backend, and 9/10= 90% success rate on Opus 4.6 LLM backend). This demonstrates the value of a dynamic approach like VeriGrey to test agents, and to eventually lead to an agent assurance framework.

VeriGrey: Greybox Agent Validation

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理