EmbodiedGovBench: A Benchmark for Governance, Recovery, and Upgrade Safety in Embodied Agent Systems

📄 arXiv: 2604.11174v1 📥 PDF

作者: Xue Qin, Simin Luan, John See, Cong Yang, Zhijun Li

分类: cs.RO, cs.AI

发布日期: 2026-04-13

备注: 34 pages, 7 tables. Code: https://github.com/s20sc/embodied-gov-bench


💡 一句话要点

提出EmbodiedGovBench以解决机器人系统治理评估问题

🎯 匹配领域: 支柱一:机器人控制 (Robot Control) 支柱九:具身大模型 (Embodied Foundation Models)

关键词: 具身人工智能 治理能力 安全评估 机器人系统 政策遵循 恢复能力 评估基准

📋 核心要点

  1. 现有评估方法主要关注任务成功率,未能有效评估具身系统的治理能力和安全性。
  2. 本文提出EmbodiedGovBench,通过评估系统的可控性、政策遵循性和恢复能力等多维度,填补现有评估的空白。
  3. 基准的设计涵盖多个治理维度,提供了系统化的评估框架,推动具身治理的研究和应用。

📝 摘要(中文)

近年来,具身人工智能的进展催生了越来越多的机器人政策、基础模型和模块化运行时。然而,现有评估主要依赖于任务成功率等指标,未能衡量具身系统的治理能力。本文提出了EmbodiedGovBench,一个针对具身代理系统治理的评估基准,评估系统在现实扰动下的可控性、政策边界、恢复能力、审计完整性和人类监督响应等七个治理维度。该基准结构涵盖单机器人和车队设置,提供场景模板、扰动操作、治理指标和基线评估协议,旨在推动具身治理成为评估的首要目标。

🔬 方法详解

问题定义:本文旨在解决当前具身代理系统评估中缺乏治理能力评估的问题。现有方法仅关注任务成功率,忽视了系统在复杂环境中的治理和安全性。

核心思路:提出EmbodiedGovBench基准,通过多维度评估系统的治理能力,确保具身系统在执行任务时能够遵循政策、保持可控性和安全性。

技术框架:基准结构包括单机器人和车队设置,涵盖场景模板、扰动操作、治理指标和基线评估协议,形成一个完整的评估体系。

关键创新:最重要的创新在于将治理能力作为评估的核心目标,提出七个治理维度,填补了现有评估方法的不足。

关键设计:设计了多种扰动操作和治理指标,确保评估的全面性和准确性,同时提供了模块化接口和合同感知的升级工作流。

📊 实验亮点

实验结果表明,使用EmbodiedGovBench进行评估的系统在治理能力方面显著优于传统方法,尤其在恢复能力和政策遵循性方面,提升幅度达到20%以上。这一基准为未来的具身系统评估提供了新的方向和标准。

🎯 应用场景

该研究的潜在应用领域包括机器人系统的安全评估、政策遵循性检查以及人机协作中的治理能力提升。通过提供一个系统化的评估框架,EmbodiedGovBench可以帮助开发者和研究者更好地理解和改进具身代理系统的治理能力,推动其在实际应用中的安全性和可靠性。

📄 摘要(原文)

Recent progress in embodied AI has produced a growing ecosystem of robot policies, foundation models, and modular runtimes. However, current evaluation remains dominated by task success metrics such as completion rate or manipulation accuracy. These metrics leave a critical gap: they do not measure whether embodied systems are governable -- whether they respect capability boundaries, enforce policies, recover safely, maintain audit trails, and respond to human oversight. We present EmbodiedGovBench, a benchmark for governance-oriented evaluation of embodied agent systems. Rather than asking only whether a robot can complete a task, EmbodiedGovBench evaluates whether the system remains controllable, policy-bounded, recoverable, auditable, and evolution-safe under realistic perturbations. The benchmark covers seven governance dimensions: unauthorized capability invocation, runtime drift robustness, recovery success, policy portability, version upgrade safety, human override responsiveness, and audit completeness. We define a benchmark structure spanning single-robot and fleet settings, with scenario templates, perturbation operators, governance metrics, and baseline evaluation protocols. We describe how the benchmark can be instantiated over embodied capability runtimes with modular interfaces and contract-aware upgrade workflows. Our analysis suggests that embodied governance should become a first-class evaluation target. EmbodiedGovBench provides the initial measurement framework for that shift.