Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models

作者: Andy K. Zhang, Neil Perry, Riya Dulepet, Joey Ji, Celeste Menders, Justin W. Lin, Eliot Jones, Gashon Hussein, Samantha Liu, Donovan Jasper, Pura Peetathawatchai, Ari Glenn, Vikram Sivashankar, Daniel Zamoshchin, Leo Glikbarg, Derek Askaryar, Mike Yang, Teddy Zhang, Rishi Alluri, Nathan Tran, Rinnara Sangpisit, Polycarpos Yiorkadjis, Kenny Osele, Gautham Raghupathi, Dan Boneh, Daniel E. Ho, Percy Liang

分类: cs.CR, cs.AI, cs.CL, cs.CY, cs.LG

发布日期: 2024-08-15 (更新: 2025-04-12)

备注: ICLR 2025 Oral

💡 一句话要点

提出Cybench框架以评估语言模型的网络安全能力与风险

🎯 匹配领域: 支柱四：生成式动作 (Generative Motion)

关键词: 网络安全 语言模型 能力评估 渗透测试 CTF任务 AI安全 框架设计

📋 核心要点

现有的语言模型在网络安全任务中表现有限，难以应对复杂的漏洞识别和攻击执行。
Cybench框架通过定义网络安全任务和子任务，提供了一种系统化评估语言模型能力的方法。
实验结果显示，GPT-4o和Claude 3.5 Sonnet等模型在解决复杂任务时表现优异，部分任务的解决速度超过人类团队。

📝 摘要（中文）

语言模型（LM）代理在网络安全领域具有自主识别漏洞和执行攻击的潜力，可能对现实世界产生影响。政策制定者、模型提供者及AI与网络安全研究者希望量化这些代理的能力，以帮助降低网络风险并探索渗透测试的机会。为此，本文提出了Cybench框架，用于指定网络安全任务并评估代理在这些任务上的表现。框架包含来自四个不同CTF竞赛的40个专业级Capture the Flag（CTF）任务，涵盖多种难度，并为每个任务提供描述和启动文件。由于许多任务超出现有LM代理的能力，本文为每个任务引入了子任务，以便进行更详细的评估。我们构建了一个网络安全代理，并评估了8个模型的性能，包括GPT-4o和Claude 3.5 Sonnet等。所有代码和数据均可公开获取。

🔬 方法详解

问题定义：本文旨在解决现有语言模型在网络安全任务中的能力评估不足的问题。现有方法无法有效量化模型在复杂任务中的表现，导致对其实际应用潜力的低估。

核心思路：Cybench框架通过定义具体的网络安全任务及其子任务，提供了一个系统化的评估平台，使得研究者能够更细致地分析语言模型在执行这些任务时的能力和局限性。

技术框架：Cybench框架包括任务定义模块、子任务分解模块和评估模块。任务定义模块负责选择和描述CTF任务，子任务分解模块将复杂任务拆分为多个可管理的步骤，评估模块则用于测试和记录模型在这些任务上的表现。

关键创新：最重要的创新在于引入了子任务的概念，使得复杂任务的评估变得更加细致和系统化。这一方法与传统的直接评估方法本质上不同，能够揭示模型在特定步骤中的表现。

关键设计：框架中每个任务都配有详细的描述和启动文件，评估过程中使用了多种模型架构，包括GPT-4o和Claude 3.5 Sonnet等，且在评估时考虑了不同的代理结构，如结构化bash和网页搜索等。

📊 实验亮点

实验结果表明，GPT-4o和Claude 3.5 Sonnet等模型在解决复杂网络安全任务时表现优异，部分任务的解决时间显著低于人类团队，最高可达11分钟，而最难任务则需人类团队24小时54分钟。这一结果展示了语言模型在网络安全领域的巨大潜力。

🎯 应用场景

Cybench框架的潜在应用领域包括网络安全教育、渗透测试和AI模型的安全性评估。通过系统化的任务评估，研究者和从业者可以更好地理解和提升语言模型在网络安全中的应用能力，从而降低潜在的网络风险。

📄 摘要（原文）

Language Model (LM) agents for cybersecurity that are capable of autonomously identifying vulnerabilities and executing exploits have potential to cause real-world impact. Policymakers, model providers, and researchers in the AI and cybersecurity communities are interested in quantifying the capabilities of such agents to help mitigate cyberrisk and investigate opportunities for penetration testing. Toward that end, we introduce Cybench, a framework for specifying cybersecurity tasks and evaluating agents on those tasks. We include 40 professional-level Capture the Flag (CTF) tasks from 4 distinct CTF competitions, chosen to be recent, meaningful, and spanning a wide range of difficulties. Each task includes its own description, starter files, and is initialized in an environment where an agent can execute commands and observe outputs. Since many tasks are beyond the capabilities of existing LM agents, we introduce subtasks for each task, which break down a task into intermediary steps for a more detailed evaluation. To evaluate agent capabilities, we construct a cybersecurity agent and evaluate 8 models: GPT-4o, OpenAI o1-preview, Claude 3 Opus, Claude 3.5 Sonnet, Mixtral 8x22b Instruct, Gemini 1.5 Pro, Llama 3 70B Chat, and Llama 3.1 405B Instruct. For the top performing models (GPT-4o and Claude 3.5 Sonnet), we further investigate performance across 4 agent scaffolds (structed bash, action-only, pseudoterminal, and web search). Without subtask guidance, agents leveraging Claude 3.5 Sonnet, GPT-4o, OpenAI o1-preview, and Claude 3 Opus successfully solved complete tasks that took human teams up to 11 minutes to solve. In comparison, the most difficult task took human teams 24 hours and 54 minutes to solve. All code and data are publicly available at https://cybench.github.io.

Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理