AI Harness Engineering: A Runtime Substrate for Foundation-Model Software Agents

作者: Hailin Zhong, Shengxin Zhu

分类: cs.SE, cs.AI

发布日期: 2026-05-13

备注: 16 pages

💡 一句话要点

提出AI Harness Engineering，提升基础模型在软件工程中的可靠性

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 软件工程 自主Agent 基础模型 运行时系统 AI Harness Engineering

📋 核心要点

现有自主软件工程Agent在实际开发中不可靠，通常归因于基础模型能力不足。
提出AI Harness Engineering，通过运行时基底Harness来调节模型与环境的交互，提升可靠性。
通过四级Harness阶梯和Trace评估协议，验证了Harness对Agent行为和结果的影响。

📝 摘要（中文）

基础模型已经变革了自动化代码生成，但自主软件工程Agent在实际开发环境中仍然不可靠。主流观点认为这是模型能力不足导致的。本文提出了不同的观点：软件工程能力源于模型-Harness-环境系统，其中运行时基底（Harness）调节了基础模型Agent如何观察项目、对其进行操作、接收反馈并确定变更已完成。本文将此基底形式化为AI Harness Engineering，并确定了11个组成部分的职责：任务规范、上下文选择、工具访问、项目记忆、任务状态、可观察性、失败归因、验证、权限、熵审计和干预记录。通过一个四级阶梯（H0-H3）来实施Harness，逐步向Agent公开运行时支持，并提出了一个基于Trace的评估协议，将每个Agent运行转换为可审计的Episode包。应用于受控验证任务时，该框架产生的Episode包的证据结构随Harness级别而系统地变化：较低级别仅生成最终补丁，较高级别生成重现日志、失败归因、确定性需求检查和结构化验证报告。该框架将自主软件工程的核心问题从基础模型是否可以生成补丁，重新定义为模型-Harness-环境系统是否可以生成可验证的、可归因的和可维护的变更。本文概述了基础模型软件Agent所需的运行时系统的研究计划。

🔬 方法详解

问题定义：现有自主软件工程Agent在实际开发环境中表现出不可靠性，主要痛点在于缺乏对模型行为的有效控制和可解释性，导致难以调试和维护生成的代码。现有方法通常侧重于提升模型本身的能力，而忽略了运行时环境对Agent行为的影响。

核心思路：本文的核心思路是将软件工程Agent视为一个模型-Harness-环境的整体系统，并引入一个运行时基底（Harness）来显式地管理和控制Agent与环境的交互。Harness负责处理任务规范、上下文选择、工具访问等关键环节，从而提高Agent的可靠性和可控性。

技术框架：AI Harness Engineering框架包含以下主要模块：1) 任务规范：明确定义Agent需要完成的任务。2) 上下文选择：为Agent提供相关的项目信息和代码片段。3) 工具访问：允许Agent使用必要的软件工程工具。4) 项目记忆：维护Agent在项目中的历史操作记录。5) 任务状态：跟踪任务的执行进度。6) 可观察性：提供Agent行为的监控和调试接口。7) 失败归因：分析Agent失败的原因。8) 验证：验证Agent生成的代码是否满足要求。9) 权限：控制Agent对项目的访问权限。10) 熵审计：监控Agent行为的随机性。11) 干预记录：记录人工干预Agent行为的事件。框架还定义了一个四级Harness阶梯（H0-H3），逐步增加Harness对Agent的运行时支持。

关键创新：最重要的技术创新在于将软件工程Agent视为一个整体系统，并引入运行时基底Harness来显式地管理和控制Agent与环境的交互。与现有方法相比，本文不再仅仅关注模型本身的能力，而是更加关注如何通过运行时环境来提升Agent的可靠性和可控性。

关键设计：Harness的四级阶梯（H0-H3）是关键设计之一，它逐步增加Harness对Agent的运行时支持，从而可以系统地评估Harness对Agent行为和结果的影响。此外，基于Trace的评估协议也是一个关键设计，它可以将每个Agent运行转换为可审计的Episode包，从而方便对Agent的行为进行分析和调试。

📊 实验亮点

实验结果表明，随着Harness级别的提高，Agent生成的Episode包的证据结构也随之变化。较低级别仅生成最终补丁，而较高级别可以生成重现日志、失败归因、确定性需求检查和结构化验证报告。这表明Harness可以有效地提升Agent的可解释性和可控性，从而提高其可靠性。

🎯 应用场景

该研究成果可应用于自动化软件开发、代码修复、代码重构等领域。通过AI Harness Engineering，可以显著提升软件工程Agent的可靠性和可控性，降低开发成本，提高软件质量。未来，该技术有望推动软件工程领域的自动化水平，并为开发者提供更强大的辅助工具。

📄 摘要（原文）

Foundation models have transformed automated code generation, yet autonomous software-engineering agents remain unreliable in realistic development settings. The dominant explanation locates this gap in model capability. We propose a different locus: software-engineering capability emerges from a model-harness-environment system, in which a runtime substrate -- the harness -- mediates how a foundation-model agent observes a project, acts on it, receives feedback, and establishes that a change is complete. We formalize this substrate as an AI Harness Engineering and identify eleven component responsibilities: task specification, context selection, tool access, project memory, task state, observability, failure attribution, verification, permissions, entropy auditing, and intervention recording. We operationalize the harness through a four-level ladder (H0-H3) that progressively exposes runtime support to the agent, and we propose a trace-based evaluation protocol that converts each agent run into an auditable episode package. Applied to a controlled validation task, the framework yields episode packages whose evidence structure varies systematically with harness level: lower levels produce only a final patch, higher levels produce reproduction logs, failure attributions, deterministic requirement checks, and structured verification reports. The framework reframes the central question of autonomous software engineering from whether a foundation model can produce a patch to whether the model-harness-environment system can produce a verifiably correct, attributed, and maintainable change. We outline a research program for the runtime systems that foundation-model software agents will require.

AI Harness Engineering: A Runtime Substrate for Foundation-Model Software Agents

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理