Offline Semantic Guidance for Efficient Vision-Language-Action Policy Distillation

作者: Jin Shi, Brady Zhang, Yishun Lu

分类: cs.CV, cs.AI

发布日期: 2026-05-15

💡 一句话要点

提出VLA-AD以解决VLA策略蒸馏效率问题

🎯 匹配领域: 支柱一：机器人控制 (Robot Control) 支柱二：RL算法与架构 (RL & Architecture) 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 视觉-语言-动作 策略蒸馏 离线语义指导 机器人操作 模型压缩

📋 核心要点

现有的VLA策略模型体积庞大，推理成本高，限制了其在实时控制中的应用。
VLA-AD框架通过引入视觉-语言模型作为离线语义指导，提升了策略蒸馏的效率和鲁棒性。
实验结果显示，VLA-AD在多个基准测试中表现优异，学生模型在推理速度上提升了3.28倍。

📝 摘要（中文）

近年来，亿级参数的视觉-语言-动作（VLA）策略在机器人操作中表现出色，但其庞大的模型规模和推理成本仍然是实时闭环控制的主要障碍。本文提出了VLA-AD蒸馏框架，利用视觉-语言模型（VLM）作为离线语义监督，将大型VLA教师模型转化为轻量级学生策略。VLA-AD通过高层语义指导增强教师提供的7自由度动作目标，使用任务阶段锚点和多帧操作方向描述等辅助信号，仅在训练期间使用。实验表明，使用OpenVLA-7B作为教师，VLA-AD生成的158M参数学生模型在模型大小上减少了44倍，同时与教师模型的平均相对误差仅为0.27%。

🔬 方法详解

问题定义：本文旨在解决现有VLA策略模型在实时控制中的效率和成本问题。现有方法主要依赖低层次的动作模仿，缺乏高层次的语义指导，导致模型在推理时的表现不佳。

核心思路：VLA-AD框架的核心思想是利用视觉-语言模型（VLM）作为离线语义监督，增强教师模型提供的动作目标，通过高层次的语义信息提升学生模型的学习效果。

技术框架：VLA-AD的整体架构包括教师模型、学生模型和VLM三个主要模块。在训练阶段，学生模型通过VLM获取高层语义指导，而在测试阶段，学生模型独立运行，无需教师模型或VLM的支持。

关键创新：VLA-AD的主要创新在于引入高层语义指导，特别是任务阶段锚点和多帧操作方向描述，使得学生模型在面对教师模型的噪声动作时更具鲁棒性。

关键设计：在模型设计中，VLA-AD采用了特定的损失函数来平衡低层动作模仿与高层语义指导的影响，同时优化了网络结构以适应轻量化需求。

🖼️ 关键图片

📊 实验亮点

VLA-AD在实验中取得了显著的性能提升，使用OpenVLA-7B作为教师模型时，生成的158M参数学生模型在模型大小上减少了44倍，同时在推理速度上提升了3.28倍，且与教师模型的平均相对误差仅为0.27%。

🎯 应用场景

该研究的潜在应用领域包括机器人操作、智能制造和人机交互等。通过提升VLA策略的效率和鲁棒性，VLA-AD能够在实际应用中实现更快的响应时间和更高的操作精度，推动智能机器人在复杂环境中的广泛应用。

📄 摘要（原文）

Billion-parameter Vision-Language-Action (VLA) policies have recently shown impressive performance in robotic manipulation, yet their size and inference cost remain major obstacles for real-time closed-loop control. We introduce \textbf{VLA-AD}, a distillation framework that uses a Vision-Language Model as an offline semantic supervisor to transfer large VLA teachers into lightweight student policies. Instead of relying only on low-level action imitation, VLA-AD augments teacher-provided 7-DoF action targets with high-level semantic guidance, including task phase anchors and multi-frame operating-direction descriptions. These auxiliary signals are used only during training: at test time, the student policy runs independently, with neither the VLA teacher nor the VLM required. We evaluate VLA-AD on three LIBERO benchmark suites. Using OpenVLA-7B as the teacher, our method produces a 158M-parameter student, yielding a $44\times$ reduction in model size while matching the teacher with only a $0.27\%$ average relative gap. The resulting policy runs at 12.5 Hz on an RTX 4090, achieving a $3.28\times$ inference speedup over OpenVLA-7B. We further show that the same semantic distillation pipeline generalizes to a different $π_{0.5}$-4B teacher, where the student outperforms the teacher on two suites and remains within $0.53\%$ on \texttt{libero_goal}. Additional analysis indicates that phase-level supervision and multi-frame directional cues make the student less sensitive to noisy teacher actions, such as erroneous high-frequency gripper changes. Overall, VLA-AD demonstrates that offline semantic guidance from VLMs can substantially improve the efficiency, robustness, and deployability of VLA policy distillation.

Offline Semantic Guidance for Efficient Vision-Language-Action Policy Distillation

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理