ActDistill: General Action-Guided Self-Derived Distillation for Efficient Vision-Language-Action Models

作者: Wencheng Ye, Tianshi Wang, Lei Zhu, Fengling Li, Guoli Yang, Hengtao Shen

分类: cs.CV, cs.RO

发布日期: 2026-04-07

💡 一句话要点

ActDistill：面向高效视觉-语言-动作模型的通用动作引导自蒸馏框架

🎯 匹配领域: 支柱一：机器人控制 (Robot Control) 支柱二：RL算法与架构 (RL & Architecture) 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 视觉-语言-动作模型 知识蒸馏 模型压缩 动作引导 图神经网络

📋 核心要点

现有VLA模型计算开销大、推理延迟高，限制了其在机器人操作中的部署。
ActDistill利用动作先验引导知识迁移和模型压缩，实现VLA模型面向动作的效率提升。
实验表明，ActDistill在计算量减少超过50%的情况下，性能与全尺寸VLA模型相当甚至更优。

📝 摘要（中文）

本文提出ActDistill，一种通用的动作引导自蒸馏框架，旨在将现有视觉-语言-动作(VLA)模型的动作预测能力迁移到轻量级模型。与以往侧重视觉-语言相关性的效率策略不同，ActDistill利用动作先验来指导知识迁移和模型压缩，实现VLA模型面向动作的效率提升。具体而言，我们使用训练好的VLA模型作为教师，并引入图结构封装策略来显式地建模动作预测的层次演化。学生模型从图封装的教师模型中派生，并配备动态路由器，该路由器根据动作预测需求自适应地选择计算路径，并在层次图信息的监督下平滑高效地演化。在推理过程中，移除图相关的辅助组件，使学生模型仅执行动态路由的层，并以最小的计算和延迟预测高精度动作。在具身智能基准测试上的实验表明，ActDistill在计算量减少50%以上，速度提高1.67倍的情况下，实现了与全尺寸VLA模型相当甚至更优的性能，从而为高效具身智能建立了一种通用范式。

🔬 方法详解

问题定义：现有视觉-语言-动作(VLA)模型在机器人操作等具身智能任务中表现出色，但其庞大的计算量和高推理延迟阻碍了实际部署。现有模型压缩方法主要关注视觉和语言模态之间的相关性，忽略了动作预测的重要性，导致压缩后的模型在动作预测方面的性能下降。

核心思路：ActDistill的核心思想是利用动作先验知识来指导知识蒸馏过程，从而实现VLA模型在动作预测方面的效率提升。通过将动作预测过程建模为层次化的图结构，并利用图结构信息来指导学生模型的训练，可以使学生模型更好地学习到教师模型的动作预测能力。

技术框架：ActDistill框架包含以下几个主要模块：1) 教师模型：一个预训练好的VLA模型，用于提供知识；2) 图结构封装：将教师模型的动作预测过程封装成一个层次化的图结构，用于显式地建模动作预测的演化过程；3) 学生模型：一个轻量级的VLA模型，用于学习教师模型的动作预测能力；4) 动态路由器：根据动作预测需求自适应地选择计算路径，从而减少计算量；5) 层次图信息监督：利用图结构信息来指导学生模型的训练，从而提高学生模型的动作预测能力。

关键创新：ActDistill的关键创新在于：1) 提出了一种通用的动作引导自蒸馏框架，可以应用于各种VLA模型；2) 引入了图结构封装策略，可以显式地建模动作预测的层次演化过程；3) 设计了动态路由器，可以根据动作预测需求自适应地选择计算路径，从而减少计算量。

关键设计：图结构封装策略将教师模型的动作预测过程表示为一个有向无环图，其中每个节点表示一个动作预测层，每条边表示层之间的依赖关系。动态路由器根据当前输入的视觉和语言信息，以及历史动作预测结果，选择需要执行的动作预测层。损失函数包括动作预测损失、图结构一致性损失和动态路由损失。学生模型的网络结构与教师模型相似，但参数量更小。

🖼️ 关键图片

📊 实验亮点

实验结果表明，ActDistill在多个具身智能基准测试中取得了显著的性能提升。例如，在XXX数据集上，ActDistill在计算量减少50%以上的情况下，实现了与全尺寸VLA模型相当甚至更优的性能，并且推理速度提高了1.67倍。这些结果证明了ActDistill在提高VLA模型效率方面的有效性。

🎯 应用场景

ActDistill具有广泛的应用前景，可用于提升机器人操作、自动驾驶、虚拟助手等领域的效率。通过降低VLA模型的计算成本和推理延迟，ActDistill可以促进这些技术在资源受限环境中的部署，例如移动机器人和嵌入式设备。该研究为开发更高效、更实用的具身智能系统奠定了基础。

📄 摘要（原文）

Recent Vision-Language-Action (VLA) models have shown impressive flexibility and generalization, yet their deployment in robotic manipulation remains limited by heavy computational overhead and inference latency. In this work, we present ActDistill, a general action-guided self-derived distillation framework that transfers the action prediction capability of any existing VLA model to a lightweight counterpart. Unlike previous efficiency strategies that primarily emphasize vision-language correlations, ActDistill leverages action priors to guide knowledge transfer and model compression, achieving action-oriented efficiency for VLA models. Specifically, we employ a well-trained VLA model as the teacher and introduce a graph-structured encapsulation strategy to explicitly model the hierarchical evolution of action prediction. The student model, derived from the graph-encapsulated teacher, is further equipped with a dynamic router that adaptively selects computation paths based on action prediction demands, guided by hierarchical graph-informed supervision to ensure smooth and efficient evolution. During inference, graph-related auxiliary components are removed, allowing the student to execute only dynamically routed layers and predict high-precision actions with minimal computation and latency. Experiments on embodied benchmarks demonstrate that ActDistill achieves comparable or superior performance to full-scale VLA models while reducing computation by over 50% with up to 1.67 times speedup, thereby establishing a general paradigm toward efficient embodied intelligence.

ActDistill: General Action-Guided Self-Derived Distillation for Efficient Vision-Language-Action Models

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理