ELT-Bench-Verified: Benchmark Quality Issues Underestimate AI Agent Capabilities

作者: Christopher Zanoli, Andrea Giovannini, Tengjun Jin, Ana Klimovic, Yotam Perlitz

分类: cs.AI, cs.DB

发布日期: 2026-03-31

💡 一句话要点

ELT-Bench-Verified：揭示基准质量问题低估AI Agent能力，并提出改进方案

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: ELT pipeline AI Agent 基准测试 质量审计 数据工程 大型语言模型 Auditor-Corrector Ground Truth

📋 核心要点

现有ELT-Bench基准测试中，AI agent在构建ELT pipeline任务上表现不佳，可能低估了其真实能力。
提出Auditor-Corrector方法，结合LLM驱动的根本原因分析和人工验证，用于审计和修正基准质量问题。
构建ELT-Bench-Verified，通过修正评估逻辑和ground truth，显著提升了AI agent在ELT pipeline构建任务上的性能。

📝 摘要（中文）

构建Extract-Load-Transform (ELT) pipelines是一项劳动密集型的数据工程任务，也是AI自动化的重要目标。在首个端到端ELT pipeline构建基准ELT-Bench上，AI agent最初表现出较低的成功率，表明它们缺乏实际效用。我们重新审视这些结果，并确定了导致agent能力被严重低估的两个因素。首先，使用升级后的大型语言模型重新评估ELT-Bench表明，提取和加载阶段已基本解决，而转换性能显着提高。其次，我们开发了一种Auditor-Corrector方法，该方法结合了可扩展的LLM驱动的根本原因分析和严格的人工验证（inter-annotator agreement Fleiss' kappa = 0.85）来审计基准质量。将其应用于ELT-Bench发现，大多数失败的转换任务都包含基准可归因的错误，包括刚性的评估脚本、模糊的规范和不正确的ground truth，这些都会惩罚正确的agent输出。基于这些发现，我们构建了ELT-Bench-Verified，这是一个经过修订的基准，具有改进的评估逻辑和更正的ground truth。在此版本上重新评估可获得显着改进，这完全归因于基准校正。我们的结果表明，快速的模型改进和基准质量问题都导致了对agent能力的低估。更广泛地说，我们的发现呼应了在text-to-SQL基准中普遍存在的注释错误的观察结果，表明质量问题在数据工程评估中是系统性的。系统的质量审计应成为复杂agent任务的标准做法。我们发布ELT-Bench-Verified，为AI驱动的数据工程自动化进展提供更可靠的基础。

🔬 方法详解

问题定义：论文旨在解决现有ELT-Bench基准测试中存在的质量问题，这些问题导致AI agent在ELT pipeline构建任务上的性能被低估。现有基准存在刚性评估脚本、模糊规范和不正确的ground truth等问题，使得即使agent输出了正确结果，也可能被错误地判定为失败。

核心思路：论文的核心思路是通过系统性的基准质量审计和修正，构建一个更可靠的ELT pipeline构建评估基准。通过识别和纠正基准中的错误，可以更准确地评估AI agent在ELT pipeline构建任务上的真实能力。

技术框架：论文提出的方法包含以下几个主要阶段：1) 使用升级后的大型语言模型重新评估ELT-Bench；2) 开发Auditor-Corrector方法，利用LLM进行根本原因分析，并结合人工验证（Fleiss' kappa = 0.85）来审计基准质量；3) 基于审计结果，构建ELT-Bench-Verified，修正评估逻辑和ground truth；4) 在ELT-Bench-Verified上重新评估AI agent的性能。

关键创新：论文的关键创新在于提出了Auditor-Corrector方法，这是一种结合LLM和人工验证的基准质量审计方法。该方法能够有效地识别和纠正基准中的错误，从而提高基准的可靠性和准确性。与传统的基准测试方法相比，Auditor-Corrector方法更加注重基准质量的保障。

关键设计：Auditor-Corrector方法中，LLM被用于自动分析agent失败的原因，并生成可能的错误类型。人工验证阶段则由专业人员对LLM的分析结果进行审核，并对基准中的错误进行修正。论文中使用了Fleiss' kappa系数来评估人工验证的一致性，确保修正结果的可靠性。此外，ELT-Bench-Verified对评估脚本进行了改进，使其更加灵活，能够更准确地评估agent的输出。

🖼️ 关键图片

📊 实验亮点

通过使用升级后的大型语言模型和修正后的ELT-Bench-Verified基准，AI agent在ELT pipeline构建任务上的性能得到了显著提升。实验结果表明，基准修正对性能提升的贡献尤为突出，这验证了基准质量对AI agent评估的重要性。Inter-annotator agreement Fleiss' kappa = 0.85，保证了人工验证的可靠性。

🎯 应用场景

该研究成果可应用于数据工程自动化领域，为AI agent在ELT pipeline构建任务中的应用提供更可靠的评估基准。通过提高基准的质量，可以更准确地评估AI agent的能力，从而加速AI在数据工程领域的应用，降低数据工程的成本和复杂度。该研究也为其他复杂agent任务的基准测试提供了借鉴，强调了基准质量审计的重要性。

📄 摘要（原文）

Constructing Extract-Load-Transform (ELT) pipelines is a labor-intensive data engineering task and a high-impact target for AI automation. On ELT-Bench, the first benchmark for end-to-end ELT pipeline construction, AI agents initially showed low success rates, suggesting they lacked practical utility. We revisit these results and identify two factors causing a substantial underestimation of agent capabilities. First, re-evaluating ELT-Bench with upgraded large language models reveals that the extraction and loading stage is largely solved, while transformation performance improves significantly. Second, we develop an Auditor-Corrector methodology that combines scalable LLM-driven root-cause analysis with rigorous human validation (inter-annotator agreement Fleiss' kappa = 0.85) to audit benchmark quality. Applying this to ELT-Bench uncovers that most failed transformation tasks contain benchmark-attributable errors -- including rigid evaluation scripts, ambiguous specifications, and incorrect ground truth -- that penalize correct agent outputs. Based on these findings, we construct ELT-Bench-Verified, a revised benchmark with refined evaluation logic and corrected ground truth. Re-evaluating on this version yields significant improvement attributable entirely to benchmark correction. Our results show that both rapid model improvement and benchmark quality issues contributed to underestimating agent capabilities. More broadly, our findings echo observations of pervasive annotation errors in text-to-SQL benchmarks, suggesting quality issues are systemic in data engineering evaluation. Systematic quality auditing should be standard practice for complex agentic tasks. We release ELT-Bench-Verified to provide a more reliable foundation for progress in AI-driven data engineering automation.

ELT-Bench-Verified: Benchmark Quality Issues Underestimate AI Agent Capabilities

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理