AutoPenBench: Benchmarking Generative Agents for Penetration Testing

作者: Luca Gioacchini, Marco Mellia, Idilio Drago, Alexander Delsanto, Giuseppe Siracusano, Roberto Bifulco

分类: cs.CR, cs.AI

发布日期: 2024-10-04 (更新: 2024-10-28)

备注: Codes for the benchmark: https://github.com/lucagioacchini/auto-pen-bench Codes for the paper experiments: https://github.com/lucagioacchini/genai-pentest-paper

🔗 代码/项目: GITHUB

💡 一句话要点

AutoPenBench：用于渗透测试生成式Agent的综合评估基准

🎯 匹配领域: 支柱四：生成式动作 (Generative Motion) 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 生成式AI Agent 渗透测试 基准测试 大型语言模型 自动化 网络安全 评估框架

📋 核心要点

现有渗透测试自动化方法缺乏统一的评估标准和框架，难以有效比较和提升生成式Agent的性能。
AutoPenBench提供了一个包含33个任务的综合基准，涵盖不同难度级别的体外和真实场景，用于评估生成式Agent在渗透测试中的表现。
实验结果表明，辅助人机交互的Agent比完全自主的Agent表现更佳，且不同LLM对Agent完成任务的能力有显著影响。

📝 摘要（中文）

生成式AI Agent，即由大型语言模型（LLM）驱动的软件系统，正成为自动化网络安全任务的一种有前景的方法。其中，渗透测试由于其任务复杂性和模拟网络攻击的多样化策略而成为一个具有挑战性的领域。尽管人们对使用生成式Agent自动化渗透测试的兴趣日益浓厚，并且已经有一些初步研究，但在评估和开发这些Agent的全面和标准框架方面仍然存在显著差距。本文介绍了AutoPenBench，这是一个用于评估自动化渗透测试中生成式Agent的开放基准。我们提出了一个全面的框架，包括33个任务，每个任务代表一个Agent必须攻击的易受攻击的系统。任务难度逐渐增加，包括体外和真实场景。我们使用通用和特定的里程碑来评估Agent的性能，这使我们能够以标准化的方式比较结果，并了解被测Agent的局限性。我们通过测试两种Agent架构来展示AutoPenBench的优势：一种是完全自主的，另一种是支持人机交互的半自主Agent。我们比较了它们的性能和局限性。例如，完全自主的Agent表现不佳，在整个基准测试中仅实现了21%的成功率（SR），仅解决了27%的简单任务和一个真实世界的任务。相比之下，辅助Agent表现出显著的改进，成功率为64%。AutoPenBench还使我们能够观察到不同的LLM（如GPT-4o或OpenAI o1）如何影响Agent完成任务的能力。我们相信我们的基准填补了空白，提供了一个标准且灵活的框架，可以在共同的基础上比较渗透测试Agent。我们希望与研究界一起扩展AutoPenBench，并将其在https://github.com/lucagioacchini/auto-pen-bench下提供。

🔬 方法详解

问题定义：目前缺乏一个标准化的、全面的基准来评估和比较用于渗透测试的生成式AI Agent。现有的研究往往使用不同的环境和评估指标，导致难以客观地衡量Agent的性能和进步。这阻碍了该领域的发展，并使得研究人员难以确定最佳的Agent架构和训练方法。

核心思路：AutoPenBench的核心思路是创建一个包含一系列渗透测试任务的标准化基准，这些任务涵盖了不同的漏洞类型和难度级别。通过在统一的平台上评估不同的Agent，可以客观地比较它们的性能，并识别它们的优势和劣势。此外，该基准还允许研究人员探索不同LLM对Agent性能的影响，并开发更有效的Agent架构。

技术框架：AutoPenBench框架包含以下主要组件：1) 一组33个渗透测试任务，每个任务代表一个易受攻击的系统；2) 一个评估指标体系，用于衡量Agent在每个任务中的表现，包括成功率和完成时间；3) 两种Agent架构：完全自主Agent和半自主Agent，后者支持人机交互；4) 一个开源代码库，允许研究人员访问和扩展基准。任务难度逐渐增加，包括体外和真实场景。

关键创新：AutoPenBench的关键创新在于其提供了一个标准化的、全面的基准，用于评估和比较用于渗透测试的生成式AI Agent。与现有的研究相比，AutoPenBench提供了一个更客观、更可重复的评估平台，并允许研究人员探索不同LLM和Agent架构对性能的影响。此外，AutoPenBench还提供了一个开源代码库，促进了研究社区的合作和发展。

关键设计：AutoPenBench的关键设计包括：1) 任务的多样性，涵盖了不同的漏洞类型和难度级别；2) 评估指标的全面性，包括成功率和完成时间；3) 两种Agent架构的对比，允许研究人员探索人机交互对性能的影响；4) 开源代码库的可用性，促进了研究社区的合作和发展。具体参数设置和损失函数等技术细节取决于所使用的LLM和Agent架构。

🖼️ 关键图片

📊 实验亮点

实验结果表明，完全自主的Agent在整个基准测试中仅实现了21%的成功率，而辅助人机交互的Agent成功率达到了64%。这表明人机交互可以显著提高Agent的性能。此外，实验还表明不同的LLM（如GPT-4o或OpenAI o1）对Agent完成任务的能力有显著影响。

🎯 应用场景

AutoPenBench可用于评估和比较不同的生成式AI Agent在渗透测试中的性能，帮助安全研究人员和从业者选择最适合其需求的Agent。此外，该基准还可以促进Agent架构和训练方法的改进，从而提高渗透测试的自动化程度和效率。未来，AutoPenBench可以扩展到其他网络安全任务，例如漏洞分析和恶意软件检测。

📄 摘要（原文）

Generative AI agents, software systems powered by Large Language Models (LLMs), are emerging as a promising approach to automate cybersecurity tasks. Among the others, penetration testing is a challenging field due to the task complexity and the diverse strategies to simulate cyber-attacks. Despite growing interest and initial studies in automating penetration testing with generative agents, there remains a significant gap in the form of a comprehensive and standard framework for their evaluation and development. This paper introduces AutoPenBench, an open benchmark for evaluating generative agents in automated penetration testing. We present a comprehensive framework that includes 33 tasks, each representing a vulnerable system that the agent has to attack. Tasks are of increasing difficulty levels, including in-vitro and real-world scenarios. We assess the agent performance with generic and specific milestones that allow us to compare results in a standardised manner and understand the limits of the agent under test. We show the benefits of AutoPenBench by testing two agent architectures: a fully autonomous and a semi-autonomous supporting human interaction. We compare their performance and limitations. For example, the fully autonomous agent performs unsatisfactorily achieving a 21% Success Rate (SR) across the benchmark, solving 27% of the simple tasks and only one real-world task. In contrast, the assisted agent demonstrates substantial improvements, with 64% of SR. AutoPenBench allows us also to observe how different LLMs like GPT-4o or OpenAI o1 impact the ability of the agents to complete the tasks. We believe that our benchmark fills the gap with a standard and flexible framework to compare penetration testing agents on a common ground. We hope to extend AutoPenBench along with the research community by making it available under https://github.com/lucagioacchini/auto-pen-bench.

AutoPenBench: Benchmarking Generative Agents for Penetration Testing

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理