VeriSpace: Spatially Grounded Action Verification for Vision-Language-Action Models

作者: Guiyu Zhao, Longteng Guo, Junyou Zhu, Jun Fu, Yanghong Mei, Bin Cao, Jie Jiang, Xingjian He, Jing Liu

分类: cs.RO

发布日期: 2026-06-09

备注: Submit to ACM MM

💡 一句话要点

提出VeriSpace以解决VLA模型的动作验证问题

🎯 匹配领域: 支柱一：机器人控制 (Robot Control) 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 视觉-语言-动作 动作验证 3D感知 空间推理 机器人操作 深度学习

📋 核心要点

现有的VLA模型在测试阶段依赖单次动作预测，导致小错误可能引发严重后果。
VeriSpace通过3D场景编码和空间推理来验证候选动作，提升动作选择的可靠性。
实验表明，VeriSpace在多个基准测试中显著提高了决策的可靠性，优于现有方法。

📝 摘要（中文）

视觉-语言-动作（VLA）模型在机器人操作中展现出强大的潜力，但在测试阶段的可靠性受到单次动作预测的限制，导致小的动作错误可能引发抓取失败、碰撞或任务进展不正确。为此，本文提出了VeriSpace，一个用于测试时动作选择的3D感知动作验证器。VeriSpace通过双路径3D注入场景编码和空间基础的动作推理两个关键组件来评估候选动作，从而在保持与现有VLA策略兼容的同时，提高动作候选的可靠性。实验结果表明，VeriSpace在公共基准和真实世界的机器人操作任务中，均显著提升了决策的可靠性。

🔬 方法详解

问题定义：本文旨在解决视觉-语言-动作（VLA）模型在测试阶段的动作验证问题。现有方法依赖单次动作预测，导致即使是微小的动作错误也可能导致抓取失败或任务进展不当。

核心思路：VeriSpace的核心思路是通过引入3D感知和空间推理来增强动作验证的可靠性。通过构建一个同时保留视觉语义和3D几何信息的场景表示，VeriSpace能够更好地评估候选动作的有效性。

技术框架：VeriSpace的整体架构包括两个主要模块：双路径3D注入场景编码和空间基础的动作推理。前者负责构建场景表示，后者则通过分析任务相关的空间关系和几何有效性来评估动作。

关键创新：VeriSpace的主要创新在于其双路径3D注入场景编码和空间推理机制，使得动作验证不仅考虑几何差异，还关注任务目标的进展。这与传统方法的单一几何验证形成鲜明对比。

关键设计：在设计中，VeriSpace采用了特定的损失函数来优化动作选择的准确性，并通过深度学习网络结构来实现高效的场景编码和推理。

🖼️ 关键图片

📊 实验亮点

在实验中，VeriSpace在多个公共基准上表现出色，相较于现有的VLA策略和基于验证的方法，决策可靠性显著提高，尤其在分布外设置中，提升幅度达到XX%。具体性能数据表明，VeriSpace在复杂任务中的成功率提高了XX%。

🎯 应用场景

VeriSpace的研究成果在机器人操作、自动化制造和智能家居等领域具有广泛的应用潜力。通过提高VLA模型的决策可靠性，能够有效减少操作错误，提升机器人在复杂环境中的适应能力，进而推动智能机器人技术的实际应用和发展。

📄 摘要（原文）

Vision-language-action (VLA) models have shown strong promise for robotic manipulation, but their reliability at test time remains limited by one-shot action prediction, where even small action errors can cause grasp failure, collision, or incorrect task progression. A natural alternative is to equip VLA systems with test-time verification, allowing multiple candidate actions to be proposed and evaluated before execution. However, reliable action verification is challenging because it requires not only distinguishing subtle geometric differences between candidate actions, but also assessing whether an action makes meaningful progress toward the task goal. We present VeriSpace, a 3D-aware action verifier for test-time action selection in VLA systems. VeriSpace evaluates candidate actions through two key components: Dual-Path 3D-Injected Scene Encoding, which constructs a scene representation that jointly preserves visual semantics and explicit 3D geometry, and Spatially-Grounded Action Reasoning, which evaluates each action by reasoning over task-relevant spatial relations, geometric validity, and expected goal progress. Together, these components enable more reliable discrimination between subtle yet outcome-critical action candidates while remaining fully compatible with existing VLA policies. Experiments on public benchmarks and real-world robotic manipulation tasks show that VeriSpace consistently improves decision reliability over both underlying VLA policies and prior verification-based methods, yielding substantial gains in both in-distribution and out-of-distribution settings.

VeriSpace: Spatially Grounded Action Verification for Vision-Language-Action Models

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理