Primitive Subspaces Mediate Few-Shot Transfer in VLAs

作者: Anya Singh, Cabrel Happi, Jai Relan, Varun Nair, Vidyut Baradwaj

分类: cs.RO

发布日期: 2026-05-29

💡 一句话要点

提出原始子空间以解决视觉-语言-动作任务的少样本迁移问题

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 视觉-语言-动作 少样本学习 迁移学习 原始感知训练 工业自动化

📋 核心要点

现有的视觉-语言-动作策略在面对新任务时需要进行微调，缺乏低成本的学习能力。
本文提出通过原始感知训练来构建可转移的子技能库，以便在推理时根据少量示例执行新任务。
实验结果表明，原始训练模型在少样本迁移中表现优异，显著提高了样本效率。

📝 摘要（中文）

在工业环境中部署视觉-语言-动作（VLA）策略需要以低成本教授新任务，而当前的VLA方法在每个新任务上都需进行微调。本文探讨了原始感知训练是否能产生可转移的成果：一个学习的子技能库，可以在推理时根据少量示例组合以执行未训练的任务。我们在REASSEMBLE数据集上训练了两种不同的VLA架构，并通过对比实验验证了原始训练模型在少样本迁移中的有效性，结果显示其在仅有3个示例的情况下达到了78%的微调上限性能，而平坦训练模型则需要10个示例，显示出3倍的样本效率差距。

🔬 方法详解

问题定义：本文旨在解决当前视觉-语言-动作策略在新任务上需微调的问题，导致高成本和低效率。

核心思路：通过原始感知训练，构建一个可转移的子技能库，使模型能够在推理时根据少量示例组合技能，执行未见过的任务。

技术框架：研究中使用了两种VLA架构，OpenVLA和$π_{0.5}$，在REASSEMBLE数据集上进行训练，采用匹配的LoRA微调方案和固定超参数，训练过程中对比了平坦轨迹和原始分段的实验。

关键创新：最重要的创新在于通过原始训练模型实现了显著的少样本迁移能力，验证了原始表示在迁移中的因果必要性，而非偶然相关性。

关键设计：在实验中，模型接受不同数量的示例（0, 1, 3, 5, 10），并在不更新权重的情况下执行任务，结果显示原始训练模型在仅有3个示例时达到了78%的微调性能，而平坦训练模型则需要10个示例。

📊 实验亮点

实验结果显示，原始训练模型在仅有3个示例的情况下达到了78%的微调上限性能，而平坦训练模型则需要10个示例，显示出3倍的样本效率差距。这一结果在不同的训练种子和数据集上均得到了验证，表明该方法的稳健性和有效性。

🎯 应用场景

该研究的潜在应用领域包括工业自动化、机器人操作和智能制造等，能够显著降低新任务学习的成本和时间，提高系统的灵活性和适应性。未来，随着技术的进步，该方法可能会在更广泛的多模态学习任务中得到应用，推动智能系统的进一步发展。

📄 摘要（原文）

Deploying vision-language-action (VLA) policies in industrial environments requires the ability to teach new tasks at low cost, a property current VLAs lack, since each new task requires fine-tuning. We investigate whether primitive-aware training produces a transferable artifact: a learned library of sub-skills that can be composed at inference time, conditioned on a small number of demonstrations, to perform tasks the policy was never trained on. We train two VLA architectures with different inductive biases, OpenVLA and $π_{0.5}$, on the REASSEMBLE contact-rich assembly dataset under matched LoRA fine-tuning recipes and locked hyperparameters, varying training between flat trajectories and primitive-segmented episodes with primitive-specific language prompts. We hold out 6 object-task combinations from training and evaluate few-shot transfer: models receive $m \in {0, 1, 3, 5, 10}$ demonstrations of a held-out task and attempt execution without weight updates. We replicate across three training seeds and validate on a second dataset (LIBERO-Long). Primitive-trained models reach 78% of fine-tuned upper-bound performance with only m=3 demonstrations, while flat-trained models require m=10 demonstrations to reach the same level -- a $3\times$ sample efficiency gap that replicates across seeds, architectures, and datasets. To establish causation, we ablate the primitive-decodable subspace of hidden states and show few-shot transfer degrades by 32 percentage points while ablating a random subspace of equal dimensionality has no effect, indicating primitive representations are causally necessary rather than incidentally correlated with transfer. We identify and correct a methodological pitfall in evaluating chunked policies: family-wise inflation of single-step action-range gates produces order-of-magnitude higher false-failure rates against ground-truth human demonstrations.

Primitive Subspaces Mediate Few-Shot Transfer in VLAs

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理