CapVector: Learning Transferable Capability Vectors in Parametric Space for Vision-Language-Action Models

作者: Wenxuan Song, Han Zhao, Fuhao Li, Ziyang Zhou, Xi Wang, Jing Lyu, Pengxiang Ding, Yan Wang, Donglin Wang, Haoang Li

分类: cs.CV, cs.RO

发布日期: 2026-05-11

💡 一句话要点

提出CapVector方法，通过参数空间解耦实现视觉-语言-动作模型的轻量化能力增强

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 视觉-语言-动作模型 参数空间解耦 具身智能 模型微调 零样本泛化 能力向量

📋 核心要点

现有VLA模型微调面临性能提升瓶颈，且引入辅助目标进行微调会带来高昂的额外计算开销。
提出CapVector方法，通过在参数空间解耦通用能力与任务分布，将辅助训练的增益转化为轻量级向量。
实验表明该方法在多种模型上均有效，且在跨环境与跨具身任务中展现出强大的零样本泛化能力。

📝 摘要（中文）

本文针对预训练视觉-语言-动作（VLA）模型在标准监督微调（SFT）中性能提升有限且适应成本高的问题，提出了一种新颖的CapVector方法。现有利用辅助训练目标的高级微调方法虽能提升性能并加速收敛，但往往伴随着显著的额外计算开销。为兼顾辅助训练的性能优势与标准SFT的简洁性，本文在参数空间内将“增强通用能力”与“拟合任务特定动作分布”两个目标进行解耦。通过在小规模任务集上对比两种不同训练策略所得模型的参数差异，提取出“能力向量”（Capability Vectors）。将这些向量与预训练参数合并，并结合轻量级正交正则化损失，即可在降低计算开销的同时达到辅助微调基线的性能水平。实验证明，CapVector具有良好的通用性，且在未见环境和具身形态下表现出卓越的零样本泛化能力。

🔬 方法详解

问题定义：VLA模型在微调时，单纯的SFT难以充分挖掘预训练模型的潜力，而引入辅助任务（如辅助损失函数）虽然能提升性能，却显著增加了训练阶段的计算复杂度和资源消耗。

核心思路：论文提出将“通用能力提升”与“任务特定拟合”解耦。通过对比两种不同训练策略（一种侧重通用能力，一种侧重任务拟合）得到的模型参数，提取出代表“能力增益”的向量，从而实现能力的模块化迁移。

技术框架：首先在小规模任务集上分别进行两种策略的微调，得到两个模型；计算两者参数的差值作为CapVector；在后续的标准SFT中，将该向量与预训练权重合并，并引入正交正则化约束，以保持模型在特定任务上的拟合精度。

关键创新：首次在参数空间内显式地将能力增强与任务拟合解耦，将复杂的辅助训练目标转化为可加性的“能力向量”，实现了性能提升与训练效率的平衡。

关键设计：核心在于参数差值的提取与合并机制，配合轻量级的正交正则化损失（Orthogonal Regularization Loss），确保合并后的模型既保留了预训练的通用性，又具备了针对特定任务的强化能力。

🖼️ 关键图片

📊 实验亮点

实验结果显示，CapVector在多种主流VLA架构上均能显著优于标准SFT，且性能可媲美复杂的辅助目标微调方法。在跨环境和跨具身形态的测试中，该方法展现了极强的泛化能力，证明了能力向量在不同任务间具有高度的可迁移性和通用性。

🎯 应用场景

该方法适用于机器人具身智能领域，特别是需要快速适应新环境或新任务的VLA模型部署。其轻量化特性使其在计算资源受限的边缘设备上具有极高价值，能够显著降低大规模机器人集群的微调成本，并提升跨平台迁移的效率。

📄 摘要（原文）

This paper proposes a novel approach to address the challenge that pretrained VLA models often fail to effectively improve performance and reduce adaptation costs during standard supervised finetuning (SFT). Some advanced finetuning methods with auxiliary training objectives can improve performance and reduce the number of convergence steps. However, they typically incur significant computational overhead due to the additional losses from auxiliary objectives. To simultaneously achieve the enhanced capabilities of auxiliary training with the simplicity of standard SFT, we decouple the two objectives of auxiliary-objective SFT within the parameter space, namely, enhancing general capabilities and fitting task-specific action distributions. To deliver the goal, we only need to train the model to converge on a small-scale task set using two distinct training strategies, resulting in two finetuned models. The parameters' difference between the two models can then be interpreted as capability vectors provided by auxiliary objectives. These vectors are then merged with pretrained parameters to form a capability-enhanced meta model. Moreover, when standard SFT is augmented with a lightweight orthogonal regularization loss, the merged model attains performance comparable to auxiliary finetuned baselines with reduced computational overhead. Internal and external experiments demonstrate that our capability vectors (1) are effective and versatile across diverse models, (2) can generalize to novel environments and embodiments out of the box.

CapVector: Learning Transferable Capability Vectors in Parametric Space for Vision-Language-Action Models

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理