Primitive Subspaces Mediate Few-Shot Transfer in VLAs
作者: Anya Singh, Cabrel Happi, Jai Relan, Varun Nair, Vidyut Baradwaj
分类: cs.RO
发布日期: 2026-05-29
💡 一句话要点
提出原始子空间以解决视觉-语言-动作任务的少样本迁移问题
🎯 匹配领域: 支柱九:具身大模型 (Embodied Foundation Models)
关键词: 视觉-语言-动作 少样本学习 迁移学习 原始感知训练 工业自动化
📋 核心要点
- 现有的视觉-语言-动作策略在面对新任务时需要进行微调,缺乏低成本的学习能力。
- 本文提出通过原始感知训练来构建可转移的子技能库,以便在推理时根据少量示例执行新任务。
- 实验结果表明,原始训练模型在少样本迁移中表现优异,显著提高了样本效率。
📝 摘要(中文)
在工业环境中部署视觉-语言-动作(VLA)策略需要以低成本教授新任务,而当前的VLA方法在每个新任务上都需进行微调。本文探讨了原始感知训练是否能产生可转移的成果:一个学习的子技能库,可以在推理时根据少量示例组合以执行未训练的任务。我们在REASSEMBLE数据集上训练了两种不同的VLA架构,并通过对比实验验证了原始训练模型在少样本迁移中的有效性,结果显示其在仅有3个示例的情况下达到了78%的微调上限性能,而平坦训练模型则需要10个示例,显示出3倍的样本效率差距。
🔬 方法详解
问题定义:本文旨在解决当前视觉-语言-动作策略在新任务上需微调的问题,导致高成本和低效率。
核心思路:通过原始感知训练,构建一个可转移的子技能库,使模型能够在推理时根据少量示例组合技能,执行未见过的任务。
技术框架:研究中使用了两种VLA架构,OpenVLA和$π_{0.5}$,在REASSEMBLE数据集上进行训练,采用匹配的LoRA微调方案和固定超参数,训练过程中对比了平坦轨迹和原始分段的实验。
关键创新:最重要的创新在于通过原始训练模型实现了显著的少样本迁移能力,验证了原始表示在迁移中的因果必要性,而非偶然相关性。
关键设计:在实验中,模型接受不同数量的示例(0, 1, 3, 5, 10),并在不更新权重的情况下执行任务,结果显示原始训练模型在仅有3个示例时达到了78%的微调性能,而平坦训练模型则需要10个示例。
📊 实验亮点
实验结果显示,原始训练模型在仅有3个示例的情况下达到了78%的微调上限性能,而平坦训练模型则需要10个示例,显示出3倍的样本效率差距。这一结果在不同的训练种子和数据集上均得到了验证,表明该方法的稳健性和有效性。
🎯 应用场景
该研究的潜在应用领域包括工业自动化、机器人操作和智能制造等,能够显著降低新任务学习的成本和时间,提高系统的灵活性和适应性。未来,随着技术的进步,该方法可能会在更广泛的多模态学习任务中得到应用,推动智能系统的进一步发展。
📄 摘要(原文)
Deploying vision-language-action (VLA) policies in industrial environments requires the ability to teach new tasks at low cost, a property current VLAs lack, since each new task requires fine-tuning. We investigate whether primitive-aware training produces a transferable artifact: a learned library of sub-skills that can be composed at inference time, conditioned on a small number of demonstrations, to perform tasks the policy was never trained on. We train two VLA architectures with different inductive biases, OpenVLA and $π_{0.5}$, on the REASSEMBLE contact-rich assembly dataset under matched LoRA fine-tuning recipes and locked hyperparameters, varying training between flat trajectories and primitive-segmented episodes with primitive-specific language prompts. We hold out 6 object-task combinations from training and evaluate few-shot transfer: models receive $m \in {0, 1, 3, 5, 10}$ demonstrations of a held-out task and attempt execution without weight updates. We replicate across three training seeds and validate on a second dataset (LIBERO-Long). Primitive-trained models reach 78% of fine-tuned upper-bound performance with only m=3 demonstrations, while flat-trained models require m=10 demonstrations to reach the same level -- a $3\times$ sample efficiency gap that replicates across seeds, architectures, and datasets. To establish causation, we ablate the primitive-decodable subspace of hidden states and show few-shot transfer degrades by 32 percentage points while ablating a random subspace of equal dimensionality has no effect, indicating primitive representations are causally necessary rather than incidentally correlated with transfer. We identify and correct a methodological pitfall in evaluating chunked policies: family-wise inflation of single-step action-range gates produces order-of-magnitude higher false-failure rates against ground-truth human demonstrations.