Automated Benchmark Auditing for AI Agents and Large Language Models

作者: Junlin Wang, Federico Bianchi, Shang Zhu, Fan Nie, Yongchan Kwon, Bhuwan Dhingra, James Zou

分类: cs.CL

发布日期: 2026-05-25

💡 一句话要点

提出Auto Benchmark Audit (ABA)框架，自动审计AI基准测试集并提升评估质量。

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 基准测试审计 AI Agent 大型语言模型 自动化评估 质量评估

📋 核心要点

现有AI基准测试集复杂性高，人工审核难以发现其中存在的隐式假设、环境依赖和评估逻辑漏洞。
提出Auto Benchmark Audit (ABA)框架，利用agent自动审计基准测试任务，发现并解决潜在问题。
实验表明，ABA能有效识别问题任务，过滤后可显著提升模型在SWE-bench Verified和Terminal-Bench 2上的性能。

📝 摘要（中文）

现代AI基准测试的复杂性已经超越了传统的验证方法。领域专家设计的任务通常包含隐式假设、不完整的环境规范和脆弱的评估逻辑，这些问题难以通过人工标注可靠地发现。本文提出了Auto Benchmark Audit (ABA)，一个agentic框架，用于系统地审计单个基准测试任务，揭示隐藏的环境依赖、规范漏洞和有限的评分逻辑等问题。我们在包含九个领域共168个基准测试的LLM基准测试集和NeurIPS出版物上运行了ABA。结果表明，ABA在超过25.7%的评估任务中发现了关键问题，包括模糊的任务设计、执行环境冲突和不正确的标准答案。专家评审和第三方报告验证了这些自动审计的准确性。至关重要的是，我们证明了这些问题任务严重扭曲了对agents和LLM的能力评估：过滤掉这些问题任务后，模型排名发生了变化，并且在SWE-bench Verified和Terminal-Bench 2上的平均性能分别提高了9.9%和9.6%。我们发布了agentic工具和所有任务注释，以支持前沿基准测试的未来发展。

🔬 方法详解

问题定义：现有AI基准测试集，尤其是针对LLM的基准测试集，存在质量问题。这些问题包括任务设计模糊、环境依赖未明确说明、评估逻辑不完善以及标准答案错误等。人工审核难以覆盖所有情况，导致模型评估结果失真，无法准确反映模型的真实能力。现有方法缺乏系统性和自动化，难以应对日益复杂的基准测试。

核心思路：ABA的核心思路是利用agent模拟模型在基准测试中的行为，通过自动化探索和验证，发现基准测试任务中存在的各种问题。通过agent与环境的交互，可以揭示隐藏的环境依赖、规范漏洞和评估逻辑缺陷。这种方法能够更全面、更高效地审计基准测试集，提高评估的准确性和可靠性。

技术框架：ABA框架包含以下主要模块：1) 任务解析模块：解析基准测试任务的描述和环境设置。2) Agent执行模块：利用agent与环境进行交互，尝试解决任务。3) 审计模块：监控agent的执行过程，检测潜在问题，例如环境依赖、规范漏洞和评估逻辑错误。4) 报告生成模块：生成审计报告，详细描述发现的问题和建议的解决方案。

关键创新：ABA的关键创新在于将agent技术应用于基准测试集的审计。与传统的人工审核相比，ABA具有更高的效率、更强的自动化能力和更全面的覆盖范围。通过agent的自动化探索和验证，可以发现人工审核难以发现的潜在问题，从而提高基准测试集的质量和评估的准确性。

关键设计：ABA的关键设计包括：1) Agent的设计：选择合适的agent架构和训练方法，使其能够有效地探索和解决基准测试任务。2) 审计规则的设计：定义明确的审计规则，用于检测潜在问题，例如环境依赖、规范漏洞和评估逻辑错误。3) 报告生成的设计：设计清晰、易懂的审计报告，方便用户理解和解决问题。

🖼️ 关键图片

📊 实验亮点

ABA在168个基准测试任务中发现了超过25.7%的问题任务，包括任务设计模糊、环境冲突和错误的标准答案。过滤掉这些问题任务后，模型在SWE-bench Verified和Terminal-Bench 2上的平均性能分别提高了9.9%和9.6%，模型排名也发生了显著变化。专家评审和第三方报告验证了ABA的审计结果。

🎯 应用场景

ABA可应用于各种AI模型的基准测试集质量评估与提升，尤其适用于LLM和Agent的评测。通过自动审计发现并修复基准测试集中的问题，可以提高模型评估的准确性和可靠性，为AI模型的开发和部署提供更可靠的依据。该研究成果有助于构建更高质量、更具代表性的基准测试集，推动AI领域的健康发展。

📄 摘要（原文）

Modern AI benchmarks operate at a complexity that outpaces traditional verification methods. Tasks authored by domain experts often contain implicit assumptions, incomplete environment specifications, and brittle evaluation logic that human annotation cannot reliably catch. We introduce Auto Benchmark Audit (ABA), an agentic framework that systematically audits individual benchmark tasks, uncovering issues such as hidden environment dependencies, specification gaps, and limited grading logic. We run ABA on a collection of frontier LLM benchmarks and previous NeurIPS publications, totaling 168 benchmarks across nine domains. Across this corpus, ABA identifies critical issues including ambiguous task design, execution environment conflicts, and incorrect ground truths in over 25.7% of the evaluated tasks. The precision of these automated audits is validated by expert review and independent third-party reports such as upstream PRs. Crucially, we demonstrate that these problematic tasks severely distorts capability assessments for agents and LLMs: filtering out these tasks with issues shifts model rankings and increases average performance on SWE-bench Verified and Terminal-Bench 2 by 9.9% and 9.6%, respectively. We release the agentic tool and all task annotations to support the future development of frontier benchmarks.

Automated Benchmark Auditing for AI Agents and Large Language Models

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理