TempoVLA: Learning Speed-Controllable Vision-Language-Action Policies

作者: Dong Jing, Jingchen Nie, Tianqi Zhang, Jiaqi Liu, Huaxiu Yao, Zhiwu Lu, Mingyu Ding

分类: cs.RO, cs.AI

发布日期: 2026-06-04

💡 一句话要点

提出TempoVLA以解决机器人操作速度控制问题

🎯 匹配领域: 支柱一：机器人控制 (Robot Control) 支柱二：RL算法与架构 (RL & Architecture) 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 机器人操作 速度控制 视觉-语言-动作 可变速度 多模态模型

📋 核心要点

现有的视觉-语言-动作模型只能在固定速度下执行，无法灵活应对不同风险阶段的速度需求。
TempoVLA通过引入可变速度轨迹增强和速度条件机制，实现了对执行速度的精确控制。
实验结果显示，TempoVLA在速度控制上表现出色，且在数据利用上提升了默认性能，具有良好的应用前景。

📝 摘要（中文）

机器人操作在低风险的过渡阶段需要快速执行，而在高风险的接触阶段则要求缓慢、精确的动作。然而，现有的视觉-语言-动作模型（VLA）仅继承固定的速度，未能有效应对减速需求。本文提出TempoVLA，通过引入可控的执行速度，结合数据侧的可变速度轨迹增强（VSTA）和模型侧的速度条件机制，显著提高了机器人在不同风险阶段的操作灵活性。实验结果表明，TempoVLA在模拟和实际任务中均实现了灵活的速度控制，并提升了默认性能。

🔬 方法详解

问题定义：本文旨在解决现有视觉-语言-动作模型在机器人操作中速度控制的不足，尤其是在高风险和低风险阶段的执行速度灵活性问题。现有方法仅能在固定速度下执行，未能有效应对减速需求。

核心思路：TempoVLA的核心思路是通过引入可控的执行速度，利用动作的预测幅度来直接调节机器人运动速度，从而实现灵活的速度控制。该方法通过数据侧的可变速度轨迹增强（VSTA）和模型侧的速度条件机制相结合，提升了操作的灵活性和精确性。

技术框架：TempoVLA的整体架构包括两个主要模块：数据侧的VSTA模块和模型侧的速度条件模块。VSTA模块负责根据目标速度对演示数据进行重新定时，而速度条件模块则将速度信息输入到策略中，以指导机器人执行。

关键创新：TempoVLA的关键创新在于引入了可变速度轨迹增强（VSTA），使得机器人能够在不同速度下执行任务，同时保持运动语义的完整性。这一设计与现有方法的本质区别在于，现有方法通常只能在固定速度下进行操作。

关键设计：在VSTA模块中，通过合并或拆分动作来实现目标速度的调整，确保运动语义不变。模型侧的速度条件机制则通过特定的参数设置来优化策略，确保机器人在不同阶段的速度控制精确有效。

🖼️ 关键图片

📊 实验亮点

实验结果表明，TempoVLA在速度控制方面表现优异，能够在低风险阶段加速并在高风险阶段减速，成功实现了灵活的速度调节。此外，VSTA模块在数据利用上提升了默认性能，实验中显示性能提升幅度达到$1 imes$以上，验证了其有效性。

🎯 应用场景

TempoVLA的研究成果在机器人操作领域具有广泛的应用潜力，特别是在需要快速与精确动作切换的任务中，如工业自动化、服务机器人和医疗机器人等。通过实现动态速度控制，TempoVLA能够提升机器人在复杂环境中的适应能力和操作效率，未来可能对智能制造和人机协作等领域产生深远影响。

📄 摘要（原文）

Robot manipulation alternates between low-risk transit phases that call for fast execution and high-risk contact stages that demand slow, precise motion. Yet existing Vision-Language-Action models (VLAs) only inherit a single fixed speed from training demonstrations. Prior efforts to accelerate VLAs through model compression, KV-cache reuse, or reinforcement learning only shift the policy from one fixed speed to another, and leave deceleration almost unexplored. We observe that the magnitude of each predicted action already governs how fast the robot moves, opening a direct route to controllable execution speed. We turn this observation into TempoVLA, a single VLA whose execution speed is controlled by an explicit condition. TempoVLA combines two coupled components. (1) A data-side Variable-Speed Trajectory Augmentation (VSTA) that re-times demonstration to any target speed by merging or splitting actions while preserving its motion semantics. (2) A model-side conditioning mechanism that feeds the speed to the policy. Statistics show that VSTA reaches the requested speed with negligible motion error. Experiments in simulation and on real-world tasks demonstrate that TempoVLA achieves flexible speed control in both directions, while VSTA additionally boosts the default $1\times$ performance via better data utilization. Furthermore, by cooperating with a large multimodal model, TempoVLA realizes dynamic speed control, accelerating through low-risk phases and decelerating for high-risk ones.

TempoVLA: Learning Speed-Controllable Vision-Language-Action Policies

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理