PiL-World: A Chunk-Wise World Model for VLA Policy-in-the-Loop Evaluation

作者: Chong Ma, Taiyi Su, Jian Zhu, Jianjun Zhang, Zitai Huang, Yi Xu, Hanli Wang

分类: cs.RO

发布日期: 2026-06-04

💡 一句话要点

提出PiL-World以解决闭环VLA评估问题

🎯 匹配领域: 支柱一：机器人控制 (Robot Control) 支柱二：RL算法与架构 (RL & Architecture) 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 视觉-语言-动作 闭环评估 世界模型 机器人操作 多视角观察 动作导向生成 任务执行上下文

📋 核心要点

现有的世界模型主要用于开放式预测，无法支持闭环的视觉-语言-动作（VLA）评估，限制了机器人任务的有效性。
本文提出的PiL-World模型通过生成与VLA策略一致的多视角未来观察，支持闭环评估，提升了评估的准确性。
在三项真实双臂操作任务中，PiL-World的表现显著优于基线方法，成功率误差大幅降低，验证了其有效性。

📝 摘要（中文）

视觉-语言-动作（VLA）策略在现实机器人任务中以闭环方式运行：机器人观察场景，执行动作块，并根据先前执行的观察结果决定下一步。然而，现有的世界模型主要限于开放式预测，无法支持闭环VLA评估。为此，本文提出了PiL-World，一个为VLA评估设计的块状世界模型。PiL-World根据当前观察和VLA策略生成的动作轨迹，生成与VLA滚动一致的多视角未来观察。通过在VLA推理和世界模型预测之间交替，PiL-World实现了无需每一步都进行真实机器人执行的闭环评估。实验结果表明，PiL-World在三项真实双臂操作任务中生成的想象滚动与真实执行高度一致，成功率误差从63.2%降低至12.0%。

🔬 方法详解

问题定义：本文旨在解决现有世界模型在机器人动作评估中无法支持闭环VLA评估的问题。现有方法通常依赖于开放式预测，无法根据先前执行的观察结果调整后续决策。

核心思路：PiL-World通过生成与VLA策略一致的多视角未来观察，结合先前的观察结果，支持闭环评估。该模型通过交替进行VLA推理和世界模型预测，避免了每一步都需真实执行机器人的限制。

技术框架：PiL-World的整体架构包括观察输入、动作轨迹生成和多视角观察预测三个主要模块。首先，模型接收当前观察和VLA策略生成的动作轨迹，然后生成未来的多视角观察。

关键创新：PiL-World的主要创新在于其块状生成方式，能够根据动作导出的视觉控制和潜在历史信息生成未来观察。这一设计使得模型能够更好地匹配真实执行的分布。

关键设计：在模型设计中，采用了基于动作的视觉控制来指导视频生成，同时结合了任务执行上下文的潜在历史信息。损失函数和网络结构的具体设置未在摘要中详细说明，需参考原文获取更多技术细节。

🖼️ 关键图片

📊 实验亮点

实验结果显示，PiL-World在三项真实双臂操作任务中生成的想象滚动与真实执行高度一致。与基线方法相比，VLA成功率的误差从63.2%显著降低至12.0%，表明该模型在闭环评估中的有效性。

🎯 应用场景

该研究的潜在应用领域包括机器人操作、自动化制造和智能家居等场景。通过提高VLA策略的评估准确性，PiL-World能够为机器人在复杂环境中的自主决策提供更可靠的支持，推动智能机器人技术的发展。

📄 摘要（原文）

Vision-language-action (VLA) policies operate in a closed loop in real-world robot tasks: a robot observes the scene, executes an action chunk, and conditions its next decision on the resulting observation. However, most existing world models for robot action evaluation are limited to open-loop prediction along pre-collected action trajectories. This prevents them from supporting closed-loop VLA evaluation, where each action chunk must be conditioned on the observation generated by the previous execution. To address this gap, we propose PiL-World, a chunk-wise world model designed for policy-in-the-loop VLA evaluation. Given the current observation and the action trajectory rolled out by a VLA policy, PiL-World generates multi-view future observations that are consistent with the VLA rollout and match the image inputs required by the policy. By alternating between VLA inference and world-model prediction, PiL-World enables closed-loop evaluation without real robot execution at every step. To improve rollout fidelity, PiL-World conditions video generation on action-derived visual control from head-view robot motion and latent histories that encode task execution context, while jointly predicting complementary multi-view observations. Beyond successful teleoperated demonstrations, it also learns from failed execution trajectories, helping the imagined rollouts better match the distribution of real policy executions. We evaluate PiL-World on three real dual-arm manipulation tasks. PiL-World generates imagined rollouts that are highly consistent with real robot executions. More importantly, compared with the baseline, it reduces the error between VLA success rates measured in real-world rollouts and those estimated through closed-loop world-model evaluation from 63.2% to 12.0%.

PiL-World: A Chunk-Wise World Model for VLA Policy-in-the-Loop Evaluation

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理