Offline Semantic Guidance for Efficient Vision-Language-Action Policy Distillation

📄 arXiv: 2605.16241v1 📥 PDF

作者: Jin Shi, Brady Zhang, Yishun Lu

分类: cs.CV, cs.AI

发布日期: 2026-05-15


💡 一句话要点

提出VLA-AD以解决VLA策略蒸馏效率问题

🎯 匹配领域: 支柱一:机器人控制 (Robot Control) 支柱二:RL算法与架构 (RL & Architecture) 支柱九:具身大模型 (Embodied Foundation Models)

关键词: 视觉-语言-动作 策略蒸馏 离线语义指导 机器人操作 模型压缩

📋 核心要点

  1. 现有的VLA策略模型体积庞大,推理成本高,限制了其在实时控制中的应用。
  2. VLA-AD框架通过引入视觉-语言模型作为离线语义指导,提升了策略蒸馏的效率和鲁棒性。
  3. 实验结果显示,VLA-AD在多个基准测试中表现优异,学生模型在推理速度上提升了3.28倍。

📝 摘要(中文)

近年来,亿级参数的视觉-语言-动作(VLA)策略在机器人操作中表现出色,但其庞大的模型规模和推理成本仍然是实时闭环控制的主要障碍。本文提出了VLA-AD蒸馏框架,利用视觉-语言模型(VLM)作为离线语义监督,将大型VLA教师模型转化为轻量级学生策略。VLA-AD通过高层语义指导增强教师提供的7自由度动作目标,使用任务阶段锚点和多帧操作方向描述等辅助信号,仅在训练期间使用。实验表明,使用OpenVLA-7B作为教师,VLA-AD生成的158M参数学生模型在模型大小上减少了44倍,同时与教师模型的平均相对误差仅为0.27%。

🔬 方法详解

问题定义:本文旨在解决现有VLA策略模型在实时控制中的效率和成本问题。现有方法主要依赖低层次的动作模仿,缺乏高层次的语义指导,导致模型在推理时的表现不佳。

核心思路:VLA-AD框架的核心思想是利用视觉-语言模型(VLM)作为离线语义监督,增强教师模型提供的动作目标,通过高层次的语义信息提升学生模型的学习效果。

技术框架:VLA-AD的整体架构包括教师模型、学生模型和VLM三个主要模块。在训练阶段,学生模型通过VLM获取高层语义指导,而在测试阶段,学生模型独立运行,无需教师模型或VLM的支持。

关键创新:VLA-AD的主要创新在于引入高层语义指导,特别是任务阶段锚点和多帧操作方向描述,使得学生模型在面对教师模型的噪声动作时更具鲁棒性。

关键设计:在模型设计中,VLA-AD采用了特定的损失函数来平衡低层动作模仿与高层语义指导的影响,同时优化了网络结构以适应轻量化需求。

🖼️ 关键图片

fig_0
fig_1
fig_2

📊 实验亮点

VLA-AD在实验中取得了显著的性能提升,使用OpenVLA-7B作为教师模型时,生成的158M参数学生模型在模型大小上减少了44倍,同时在推理速度上提升了3.28倍,且与教师模型的平均相对误差仅为0.27%。

🎯 应用场景

该研究的潜在应用领域包括机器人操作、智能制造和人机交互等。通过提升VLA策略的效率和鲁棒性,VLA-AD能够在实际应用中实现更快的响应时间和更高的操作精度,推动智能机器人在复杂环境中的广泛应用。

📄 摘要(原文)

Billion-parameter Vision-Language-Action (VLA) policies have recently shown impressive performance in robotic manipulation, yet their size and inference cost remain major obstacles for real-time closed-loop control. We introduce \textbf{VLA-AD}, a distillation framework that uses a Vision-Language Model as an offline semantic supervisor to transfer large VLA teachers into lightweight student policies. Instead of relying only on low-level action imitation, VLA-AD augments teacher-provided 7-DoF action targets with high-level semantic guidance, including task phase anchors and multi-frame operating-direction descriptions. These auxiliary signals are used only during training: at test time, the student policy runs independently, with neither the VLA teacher nor the VLM required. We evaluate VLA-AD on three LIBERO benchmark suites. Using OpenVLA-7B as the teacher, our method produces a 158M-parameter student, yielding a $44\times$ reduction in model size while matching the teacher with only a $0.27\%$ average relative gap. The resulting policy runs at 12.5 Hz on an RTX 4090, achieving a $3.28\times$ inference speedup over OpenVLA-7B. We further show that the same semantic distillation pipeline generalizes to a different $π_{0.5}$-4B teacher, where the student outperforms the teacher on two suites and remains within $0.53\%$ on \texttt{libero_goal}. Additional analysis indicates that phase-level supervision and multi-frame directional cues make the student less sensitive to noisy teacher actions, such as erroneous high-frequency gripper changes. Overall, VLA-AD demonstrates that offline semantic guidance from VLMs can substantially improve the efficiency, robustness, and deployability of VLA policy distillation.