Key-Gram: Extensible World Knowledge for Embodied Manipulation

📄 arXiv: 2605.18556v1 📥 PDF

作者: Jingjing Fan, Siyuan Li, Botao Ren, Zhidong Deng

分类: cs.RO, cs.AI

发布日期: 2026-05-18

备注: 16 pages, 5 figures


💡 一句话要点

提出Key-Gram框架以解决语言知识与视觉推理耦合问题

🎯 匹配领域: 支柱一:机器人控制 (Robot Control) 支柱九:具身大模型 (Embodied Foundation Models)

关键词: 具身控制 条件记忆 视觉推理 语言知识 机器人操作 动态环境 性能提升

📋 核心要点

  1. 现有的视觉-语言-行动模型将语言知识与视觉计算耦合,导致知识扩展困难且性能受限。
  2. Key-Gram框架通过条件记忆模块将语言知识与视觉推理分离,提升了模型的灵活性与扩展性。
  3. 在多个基准测试中,Key-Gram相较于传统方法在性能上有显著提升,平均增益达到29.5%至35.8%。

📝 摘要(中文)

随着具身控制对组合语言指令的需求增加,现有的视觉-语言-行动策略常常将语言知识与视觉计算耦合在一起,导致知识扩展依赖于主干网络的更新。本文提出了Key-Gram,一个条件记忆框架,旨在将语言衍生的世界知识与视觉状态推理分离。其核心是一个记忆模块,将指令分解为任务特定的关键语法,通过确定性哈希查找检索静态语言先验,并通过上下文感知门控和轻量卷积融合将检索到的条目注入选定的隐藏层。这一设计使得主干网络能够将主要能力集中于视觉推理和行动推断,同时可重用的指令知识存储在可扩展的外部记忆中。实验结果表明,Key-Gram在多个任务上显著提升了性能。

🔬 方法详解

问题定义:本文旨在解决现有视觉-语言-行动模型中语言知识与视觉推理耦合的问题。这种耦合导致了知识扩展的困难和性能的限制。

核心思路:Key-Gram框架的核心思路是通过条件记忆模块将语言衍生的世界知识与视觉状态推理分离,从而使主干网络能够专注于视觉推理和行动推断。

技术框架:整体架构包括一个记忆模块,该模块将指令分解为任务特定的关键语法,并通过哈希查找检索静态语言先验,最后通过上下文感知门控将这些信息注入到网络的隐藏层中。

关键创新:最重要的创新在于将语言知识外部化,通过可扩展的外部记忆存储可重用的指令知识,避免了对主干网络的频繁更新。

关键设计:在设计中,采用了确定性哈希查找以实现O(1)的查找效率,并在训练过程中可以方便地对逻辑记忆表进行分区,确保在推理时的高效性。整个框架的轻量化设计使得其在实际应用中具有良好的性能。

🖼️ 关键图片

fig_0
fig_1
fig_2

📊 实验亮点

实验结果显示,Key-Gram在RoboTwin2.0上相较于基线模型提升了29.5%和9.9%的性能,在LIBERO-Plus上提升了35.8%和4.5%,在真实世界的长时间任务中提升了15.4%和8.1%。这些结果表明外部语言记忆的有效性和可扩展性。

🎯 应用场景

Key-Gram框架在具身操作、机器人控制和人机交互等领域具有广泛的应用潜力。通过提升模型对复杂指令的理解和执行能力,该研究能够推动智能机器人在动态环境中的自主操作,提升其在实际任务中的表现。

📄 摘要(原文)

Embodied control increasingly requires models to follow compositional language instructions while reasoning over dynamic visual states. However, current vision-language-action policies and world-action models often couple linguistic knowledge with visual computation in a shared backbone or conditioning pathway, leading to modality competition and making knowledge extension dependent on backbone updates. In this paper, we introduce Key-Gram, a conditional-memory framework that separates language-derived world knowledge from visual-state reasoning for embodied control. At its core is a memory module that decomposes an instruction into task-specific key-grams, retrieves static linguistic priors through deterministic hashed lookup, and injects the retrieved entries into selected hidden layers through context-aware gating and lightweight convolutional fusion. This design allows the backbone to devote its main capacity to visual reasoning and action inference, while reusable instruction knowledge is stored in an extensible external memory. The logical memory table can be conveniently partitioned during training and, due to its $O(1)$ lookup pattern, efficiently placed on host memory during inference. Across RoboTwin2.0, LIBERO/LIBERO-Plus, and real-world dual-arm manipulation, Key-Gram consistently improves both $π_{0}$ and $π_{0.5}$ backbones, with average relative gains of $29.5\%/9.9\%$ on RoboTwin2.0, $35.8\%/4.5\%$ on LIBERO-Plus transfer without target-domain fine-tuning, and $15.4\%/8.1\%$ on real-world long-horizon tasks. These results demonstrate that externalized linguistic memory provides an effective and extensible mechanism for improving compositional grounding, transfer, and real-world manipulation.