MemoryVLA++: Temporal Modeling via Memory and Imagination in Vision-Language-Action Models

作者: Hao Shi, Weiye Li, Bin Xie, Yulin Wang, Renping Zhou, Tiancai Wang, Xiangyu Zhang, Ping Luo, Gao Huang

分类: cs.RO, cs.CV

发布日期: 2026-06-08

备注: The project is available at https://shihao1895.github.io/MemoryVLA-PP-Web

🔗 代码/项目: PROJECT_PAGE

💡 一句话要点

提出MemoryVLA++以解决机器人操控中的时间建模问题

🎯 匹配领域: 支柱一：机器人控制 (Robot Control) 支柱二：RL算法与架构 (RL & Architecture) 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 时间建模 机器人操控 视觉-语言-动作 记忆机制 想象能力 多模态学习 深度学习

📋 核心要点

现有的VLA模型主要依赖当前观察，难以处理长时间依赖的操控任务，缺乏有效的时间建模能力。
MemoryVLA++通过引入记忆和想象机制，结合工作记忆和世界模型，提升了机器人对时间信息的处理能力。
在多个基准测试中，该方法在一般操控、依赖记忆和想象的任务上分别提升了9%、26%和28%的性能，验证了其有效性。

📝 摘要（中文）

时间建模对于机器人操控至关重要，因为有效控制需要记忆过去的交互和想象未来的状态。然而，大多数视觉-语言-动作（VLA）模型主要依赖当前观察，难以处理长时间依赖的任务。受认知科学启发，本文提出MemoryVLA++，为VLA模型提供记忆和想象能力。该框架通过预训练的视觉语言模型（VLM）编码当前观察，形成工作记忆，并通过感知-认知记忆库检索相关历史上下文。世界模型在去噪潜在空间中想象未来状态，最终生成的时间感知令牌用于预测时间一致的动作序列。实验结果表明，该方法在多个模拟基准和真实机器人任务中表现优异，验证了记忆与想象结合的有效性。

🔬 方法详解

问题定义：本文旨在解决现有VLA模型在长时间依赖任务中的不足，尤其是缺乏对过去交互的记忆和未来状态的想象能力。

核心思路：MemoryVLA++通过构建一个完整的时间建模框架，结合工作记忆和世界模型，来增强机器人对时间信息的理解和处理能力。

技术框架：该框架包括三个主要模块：预训练的视觉语言模型（VLM）用于编码当前观察，感知-认知记忆库用于存储和检索历史上下文，世界模型用于想象未来状态。

关键创新：最重要的创新在于引入了感知-认知记忆库和世界模型的结合，使得模型能够在时间维度上进行有效的信息整合和预测，显著提升了任务的执行能力。

关键设计：在设计中，使用了冗余感知的记忆库更新机制，确保历史信息的有效存储和利用，同时在去噪潜在空间中进行未来状态的想象，形成时间感知令牌。具体的损失函数和网络结构设计也经过精心调整，以优化模型性能。

🖼️ 关键图片

📊 实验亮点

在实验中，MemoryVLA++在Libero、SimplerEnv、Mikasa-Robo等多个基准上表现出色，尤其在真实机器人任务中，分别在一般、依赖记忆和依赖想象的任务上实现了9%、26%和28%的性能提升，验证了其有效性。

🎯 应用场景

该研究的潜在应用领域包括机器人操控、自动化生产线、智能家居等场景。通过提升机器人对时间信息的处理能力，MemoryVLA++能够在复杂环境中实现更高效的任务执行，具有重要的实际价值和广泛的应用前景。

📄 摘要（原文）

Temporal modeling is essential for robotic manipulation, as effective control requires both memory of past interactions and imagination of future states. However, most VLA models rely primarily on the current observation and therefore struggle with long-horizon, temporally dependent tasks. Cognitive science suggests that humans rely on working memory to buffer short-lived context, the hippocampal system to preserve episodic memory of past experience, and internal models to imagine possible future state evolution. Inspired by these mechanisms, we propose MemoryVLA++, a full temporal modeling framework that equips VLA models with memory and imagination for robotic manipulation. A pretrained VLM encodes the current observation into perceptual and cognitive tokens, forming working memory. These tokens query a Perceptual-Cognitive Memory Bank to retrieve relevant historical context. This bank stores low-level details and high-level semantics from past interactions, and is updated through redundancy-aware consolidation. A world model imagines future states in a denoising latent space, and the imagined latents are integrated under memory guidance to form full temporal-aware tokens. The resulting tokens condition a diffusion action expert to predict temporally consistent action sequences. We conduct extensive experiments on 5 simulation benchmarks and 3 categories of real-robot tasks across 3 robots, covering general manipulation, long-horizon temporal tasks, robustness, and generalization. Our method achieves strong performance across Libero, SimplerEnv, Mikasa-Robo, Calvin, Libero-Plus, and diverse real-robot tasks, validating the effectiveness of full temporal modeling with memory and imagination. For example, on real robots, it achieves +9%, +26%, +28% gains on general, memory-dependent, and imagination-dependent tasks. Project Page: https://shihao1895.github.io/MemoryVLA-PP-Web

MemoryVLA++: Temporal Modeling via Memory and Imagination in Vision-Language-Action Models

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理