$ω$-EVA: Envision, Verify, and Act with Latent Interactive World Models

作者: Zhenguo Sun, Yu Sun, Hande Huang, Alois Knoll

分类: cs.RO

发布日期: 2026-06-08

💡 一句话要点

提出$ω$-EVA以解决现有动作生成模型的局限性

🎯 匹配领域: 支柱一：机器人控制 (Robot Control) 支柱二：RL算法与架构 (RL & Architecture) 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 具身策略 世界模型 动作生成 潜在动态 决策透明性 机器人控制 人机交互

📋 核心要点

现有的具身策略方法通常忽视候选动作的后果，导致决策过程缺乏透明性和有效性。
$ω$-EVA通过设想-验证-行动循环，允许策略在执行前检查提议的后果，从而增强决策的合理性。
在多种仿真设置下，$ω$-EVA的完整交互管道显著提升了提议策略的性能，展示了潜在的未来结构。

📝 摘要（中文）

具身策略通常将当前观察直接映射到动作，未能明确候选动作的后果。世界模型提供预测监督和表示，但很少允许策略在行动前检查其提议的想象后果。本文提出$ω$-EVA，一个潜在交互式世界模型，实现了具身动作生成的设想-验证-行动循环。该框架学习基于动作的潜在动态，训练语言条件流策略，并通过世界模型反馈策略的提议。实验表明，完整的交互管道持续改善提议策略，且潜在诊断显示出有意义的动作条件未来结构。$ω$-EVA在没有额外机器人数据预训练的情况下，展现出紧凑且具有竞争力的性能-规模-数据权衡。

🔬 方法详解

问题定义：本文旨在解决具身策略在动作生成中未能明确候选动作后果的问题，现有方法往往缺乏对未来后果的有效推理。

核心思路：$ω$-EVA通过引入设想-验证-行动循环，使得策略能够在执行前评估其提议的后果，从而提高决策的有效性和透明度。

技术框架：该方法分为三个主要阶段：学习基于动作的潜在动态、训练语言条件流策略、通过世界模型反馈策略的提议。

关键创新：$ω$-EVA的主要创新在于将后果推理保持在潜在特征空间中，避免了在推理时生成未来视频，这与传统方法形成鲜明对比。

关键设计：模型包含约12亿参数，且没有额外的机器人数据预训练，展现出紧凑的性能与数据需求之间的平衡。

🖼️ 关键图片

📊 实验亮点

实验结果表明，$ω$-EVA在多种单臂、双臂和长时间仿真设置中，均显著提升了提议策略的性能。与基线相比，完整的交互管道在动作生成的有效性和准确性上均有明显改善，展示了潜在的未来结构的合理性。

🎯 应用场景

$ω$-EVA的研究成果在机器人控制、自动化任务执行以及人机交互等领域具有广泛的应用潜力。通过增强决策过程的透明性和有效性，该模型能够提升机器人在复杂环境中的自主性和适应能力，推动智能系统的发展。

📄 摘要（原文）

Embodied policies typically map current observations directly to actions, leaving candidate-action consequences implicit. World models provide predictive supervision, representations, or external simulation, but rarely let a policy inspect the imagined consequence of its own proposal before acting. We introduce $ω$-EVA, a latent interactive world model that realizes an Envision--Verify--Act loop for embodied action generation. Its three-stage framework learns action-conditioned latent dynamics, trains a language-conditioned flow policy on dynamics-aware visual representations, and feeds the policy's proposal back through the world model. A tri-branch refiner jointly reasons over the current state, proposal-conditioned future, and proposed action to produce the final action chunk. Because consequence reasoning remains in latent feature space, $ω$-EVA avoids generating future videos at inference. Evaluations across diverse single-arm, bimanual, long-horizon, and perturbed simulation settings show that the complete interaction pipeline consistently improves the proposal policy, while latent diagnostics indicate meaningful action-conditioned future structure. With approximately 1.2B parameters and no additional robot-data pretraining, $ω$-EVA demonstrates a compact and competitive performance--scale--data trade-off, making the world model an active action-feedback module rather than a passive predictor.

$ω$-EVA: Envision, Verify, and Act with Latent Interactive World Models

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理