DySL-VLA: Efficient Vision-Language-Action Model Inference via Dynamic-Static Layer-Skipping for Robot Manipulation

作者: Zebin Yang, Yijiahao Qi, Tong Xie, Bo Yu, Shaoshan Liu, Meng Li

分类: cs.RO

发布日期: 2026-02-26

备注: DAC 2026

🔗 代码/项目: GITHUB

💡 一句话要点

提出DySL-VLA以解决机器人操作中的高计算成本问题

🎯 匹配领域: 支柱一：机器人控制 (Robot Control) 支柱二：RL算法与架构 (RL & Architecture) 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 动态跳过 视觉-语言-动作 机器人操作 知识蒸馏 计算效率 实时性能 多模态融合

📋 核心要点

现有的视觉-语言-动作模型在实时性能上面临高计算成本的挑战，限制了其在实际应用中的推广。
DySL-VLA框架通过动态跳过不重要的VLA层，优化计算资源的使用，提升了模型的效率。
实验结果显示，DySL-VLA在成功率、参数量和推理速度上均有显著提升，展示了其在机器人操作中的应用潜力。

📝 摘要（中文）

视觉-语言-动作（VLA）模型在机器人操作等任务中取得了显著成功，通过将语言模型的推理与视觉模型的三维理解相结合。然而，其高计算成本仍然是实时应用的主要障碍。我们观察到任务中的动作具有不同的重要性：关键步骤需要高精度，而不太重要的步骤可以容忍更多的变动。基于此，我们提出了DySL-VLA框架，通过动态跳过VLA层来降低计算成本。DySL-VLA将其层分为两类：信息层和增量层，以智能地跳过层而不牺牲准确性。实验表明，DySL-VLA在Calvin数据集上相较于Deer-VLA提高了2.1%的成功率，同时将可训练参数减少了85.7倍，并在保持准确性的情况下相较于RoboFlamingo基线实现了3.75倍的加速。

🔬 方法详解

问题定义：本论文旨在解决视觉-语言-动作（VLA）模型在机器人操作中面临的高计算成本问题。现有方法在实时应用中难以满足性能要求，尤其是在处理复杂任务时。

核心思路：DySL-VLA框架的核心思路是根据每个动作的重要性动态跳过VLA层，从而减少计算负担。通过识别关键步骤与可容忍变动的步骤，优化模型的执行效率。

技术框架：DySL-VLA的整体架构包括信息层和增量层两种类型的层。信息层始终执行，而增量层则根据动作的重要性选择性跳过。此外，采用了先验-后验跳过引导机制来智能判断跳过时机。

关键创新：最重要的技术创新在于动态跳过机制和跳过感知的两阶段知识蒸馏算法。这一设计使得模型在保持准确性的同时，显著降低了计算复杂度。

关键设计：在模型训练中，采用了新的损失函数和网络结构设计，以支持跳过层的动态选择。通过知识蒸馏的方式，将标准VLA模型转化为DySL-VLA，确保了模型的高效性与准确性。

🖼️ 关键图片

📊 实验亮点

在Calvin数据集上，DySL-VLA相较于Deer-VLA实现了2.1%的成功率提升，同时将可训练参数减少了85.7倍，并在保持相同准确性的情况下，相较于RoboFlamingo基线实现了3.75倍的推理速度提升，展示了其卓越的性能。

🎯 应用场景

DySL-VLA框架在机器人操作领域具有广泛的应用潜力，尤其是在需要实时决策和高效执行的任务中，如自动化仓储、智能家居和工业机器人等。其高效的计算能力和准确性使得机器人能够更好地适应复杂环境，提升工作效率。

📄 摘要（原文）

Vision-Language-Action (VLA) models have shown remarkable success in robotic tasks like manipulation by fusing a language model's reasoning with a vision model's 3D understanding. However, their high computational cost remains a major obstacle for real-world applications that require real-time performance. We observe that the actions within a task have varying levels of importance: critical steps demand high precision, while less important ones can tolerate more variance. Leveraging this insight, we propose DySL-VLA, a novel framework that addresses computational cost by dynamically skipping VLA layers based on each action's importance. DySL-VLA categorizes its layers into two types: informative layers, which are consistently executed, and incremental layers, which can be selectively skipped. To intelligently skip layers without sacrificing accuracy, we invent a prior-post skipping guidance mechanism to determine when to initiate layer-skipping. We also propose a skip-aware two-stage knowledge distillation algorithm to efficiently train a standard VLA into a DySL-VLA. Our experiments indicate that DySL-VLA achieves 2.1% improvement in success length over Deer-VLA on the Calvin dataset, while simultaneously reducing trainable parameters by a factor of 85.7 and providing a 3.75x speedup relative to the RoboFlamingo baseline at iso-accuracy. Our code is available on https://github.com/PKU-SEC-Lab/DYSL_VLA.

DySL-VLA: Efficient Vision-Language-Action Model Inference via Dynamic-Static Layer-Skipping for Robot Manipulation

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理