ActionMap: Robot Policy Learning via Voxel Action Heatmap

作者: Pei Yang, Hai Ci, Yanzhe Chen, Qi Lv, Han Cai, Mike Zheng Shou

分类: cs.RO, cs.CV

发布日期: 2026-06-05

🔗 代码/项目: GITHUB

💡 一句话要点

提出ActionMap以提升机器人策略学习的效率与准确性

🎯 匹配领域: 支柱一：机器人控制 (Robot Control) 支柱二：RL算法与架构 (RL & Architecture) 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 视觉-语言-动作 机器人策略学习 体素热图 动作解码器 数据效率 几何邻近性 深度学习

📋 核心要点

现有的视觉-语言-动作模型在动作解码器上存在不足，未能利用动作空间的几何邻近性。
本文提出ActionMap，通过体素热图的方式对动作空间进行建模，提升动作预测的准确性和效率。
实验结果表明，ActionMap在多个基准测试中超越了现有方法，并在低数据条件下表现出更高的效率。

📝 摘要（中文）

视觉-语言-动作（VLA）模型在骨干网络、训练策略和数据规模上取得了快速进展，但动作解码器几乎没有变化，仍然是单点预测。现有解码器未能充分利用动作空间的几何邻近性。为此，本文提出了ActionMap，一个体素热图动作头，替代现有VLA中的原生动作解码器。该方法在LIBERO仿真和真实世界的Franka操作中表现优异，超越了两种不同架构的骨干网络，并在低训练数据下展现出显著的数据效率。

🔬 方法详解

问题定义：本文旨在解决现有视觉-语言-动作模型中动作解码器的局限性，现有方法通常将动作空间视为无结构，未能利用动作之间的几何关系。

核心思路：提出ActionMap，通过体素热图的方式对每个动作进行建模，使得每个体素直接存储对应动作的概率，从而更好地捕捉动作之间的关系。

技术框架：整体架构包括一个新的体素热图动作头，该头可以替代现有VLA模型中的动作解码器。模型通过输入的隐藏状态生成体素热图，进而输出连续控制信号。

关键创新：最重要的创新在于引入体素热图动作头，显著改善了动作表示的结构性，与传统的单点预测方法相比，能够更有效地利用动作空间的几何信息。

关键设计：在设计中，体素热图的分辨率、损失函数的选择以及网络结构的调整都是关键因素，确保模型在不同训练条件下均能保持高效性和准确性。

🖼️ 关键图片

📊 实验亮点

实验结果显示，ActionMap在LIBERO四套件平均上比OpenVLA-OFT的L1回归头提升了8.2%。此外，该方法在不同骨干网络上表现出一致性，且在低训练数据下展现出更高的数据效率，证明了动作表示对VLA性能的重要性。

🎯 应用场景

该研究的潜在应用场景包括机器人操作、自动化控制和人机交互等领域。通过提升机器人在复杂环境中的动作预测能力，ActionMap有望在智能制造、服务机器人等实际应用中发挥重要作用，推动相关技术的发展与应用。

📄 摘要（原文）

Vision-language-action (VLA) models have advanced rapidly across backbones, training recipes, and data scale, yet the action decoder, which converts the backbone's hidden state into a continuous control signal, has barely changed and remains a single-point predictor across the majority of current VLAs. Whether implemented via autoregressive token bins, L1 regression, or flow-matching denoising, the resulting decoder treats the action space as unstructured, leaving the geometric proximity of neighboring actions unexploited during training. To advance this, we introduce ActionMap, a voxel heatmap action head that drops into an existing VLA in place of its native action decoder. For each new action, the head predicts a voxel heatmap over the action space, where each voxel directly stores the probability of the corresponding action. Across LIBERO simulation and real-world Franka manipulation, our heatmap head surpasses two architecturally distinct backbones at matched training steps (e.g., +8.2% over OpenVLA-OFT's L1 regression head on the LIBERO four-suite average), converges at comparable or faster rates on both backbones, and remains markedly more data-efficient at low training data. The cross-backbone consistency indicates that action representation is a real lever for VLA performance, distinct from further backbone or recipe scaling. Project Page: https://github.com/showlab/ActionMap.

ActionMap: Robot Policy Learning via Voxel Action Heatmap

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理