Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

作者: Moo Jin Kim, Chelsea Finn, Percy Liang

分类: cs.RO, cs.AI, cs.CV, cs.LG

发布日期: 2025-02-27 (更新: 2025-04-28)

备注: Accepted to Robotics: Science and Systems (RSS) 2025. Project website: https://openvla-oft.github.io/

🔗 代码/项目: PROJECT_PAGE

💡 一句话要点

提出优化微调策略以提升视觉-语言-动作模型性能

🎯 匹配领域: 支柱一：机器人控制 (Robot Control) 支柱二：RL算法与架构 (RL & Architecture) 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 视觉-语言-动作 微调策略 机器人控制 多模态学习 性能优化

📋 核心要点

现有的视觉-语言-动作模型在新机器人设置下的性能不佳，微调策略尚不明确。
本文提出了一种优化微调策略，结合多种设计选择以提升模型的推理效率和灵活性。
实验结果显示，OpenVLA-OFT在多个任务上成功率显著提升，并在真实环境中表现优于其他模型。

📝 摘要（中文）

近年来，视觉-语言-动作模型（VLA）在任务执行、语言跟随能力和语义泛化方面表现出色，但在新机器人设置下的性能仍需微调。本文研究了VLA微调中的关键设计选择，包括不同的动作解码方案、动作表示和学习目标。通过对OpenVLA模型的实证分析，提出了一种优化微调（OFT）策略，结合并行解码、动作分块、连续动作表示和简单的L1回归学习目标，显著提高了推理效率和策略性能。OpenVLA-OFT在LIBERO基准测试中设立了新的性能记录，成功率从76.5%提升至97.1%，并在真实场景中超越了其他VLA模型。

🔬 方法详解

问题定义：本文旨在解决视觉-语言-动作模型在新机器人设置下的微调问题，现有方法在适应性和性能上存在不足。

核心思路：提出了一种优化微调策略（OFT），通过结合并行解码、动作分块和连续动作表示等设计，提升模型的推理效率和灵活性。

技术框架：整体架构包括多个模块：首先是并行解码以提高效率，其次是动作分块以优化动作表示，最后使用L1回归作为学习目标以简化训练过程。

关键创新：最重要的创新在于提出了一种新的微调策略，整合了多种设计选择，使得模型在推理速度和成功率上均有显著提升。

关键设计：在参数设置上，采用了连续动作表示和简单的L1损失函数，网络结构上则通过并行解码和动作分块来优化信息流动。

🖼️ 关键图片

📊 实验亮点

实验结果显示，OpenVLA-OFT在LIBERO基准测试中成功率从76.5%提升至97.1%，并且动作生成吞吐量提高了26倍。此外，在真实环境中，该模型在高频控制任务上超越了其他VLA模型和从零开始训练的模仿学习策略，平均成功率提升幅度达到15%。

🎯 应用场景

该研究的潜在应用领域包括机器人控制、自动化任务执行和人机交互等。通过提升视觉-语言-动作模型的适应性和性能，未来可在复杂环境中实现更高效的机器人操作，推动智能机器人技术的发展。

📄 摘要（原文）

Recent vision-language-action models (VLAs) build upon pretrained vision-language models and leverage diverse robot datasets to demonstrate strong task execution, language following ability, and semantic generalization. Despite these successes, VLAs struggle with novel robot setups and require fine-tuning to achieve good performance, yet how to most effectively fine-tune them is unclear given many possible strategies. In this work, we study key VLA adaptation design choices such as different action decoding schemes, action representations, and learning objectives for fine-tuning, using OpenVLA as our representative base model. Our empirical analysis informs an Optimized Fine-Tuning (OFT) recipe that integrates parallel decoding, action chunking, a continuous action representation, and a simple L1 regression-based learning objective to altogether improve inference efficiency, policy performance, and flexibility in the model's input-output specifications. We propose OpenVLA-OFT, an instantiation of this recipe, which sets a new state of the art on the LIBERO simulation benchmark, significantly boosting OpenVLA's average success rate across four task suites from 76.5% to 97.1% while increasing action generation throughput by 26$\times$. In real-world evaluations, our fine-tuning recipe enables OpenVLA to successfully execute dexterous, high-frequency control tasks on a bimanual ALOHA robot and outperform other VLAs ($π_0$ and RDT-1B) fine-tuned using their default recipes, as well as strong imitation learning policies trained from scratch (Diffusion Policy and ACT) by up to 15% (absolute) in average success rate. We release code for OFT and pretrained model checkpoints at https://openvla-oft.github.io/.

Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理