RoboStream: Weaving Spatio-Temporal Reasoning with Memory in Vision-Language Models for Robotics

作者: Yuzhi Huang, Jie Wu, Weijue Bu, Ziyi Xiong, Gaoyang Jiang, Ye Li, Kangye Ji, Shuzhao Xie, Yue Huang, Chenglei Wu, Jingyan Jiang, Zhi Wang

分类: cs.RO

发布日期: 2026-03-13

💡 一句话要点

RoboStream：融合时空推理与记忆的视觉-语言模型，提升机器人操作能力

🎯 匹配领域: 支柱一：机器人控制 (Robot Control) 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 机器人操作 视觉-语言模型 时空推理 因果推理 长程规划 几何锚定 状态转换

📋 核心要点

基于视觉-语言模型的机器人规划器缺乏时空推理能力，无法有效记忆动作对环境的影响，导致长程任务中感知误差累积。
RoboStream通过时空融合令牌（STF-Tokens）实现几何锚定，并利用因果时空图（CSTG）记录状态转换，模拟人类的因果时空推理能力。
RoboStream在长程RLBench和真实世界积木搭建任务中显著优于现有方法，验证了时空推理和因果记忆对长程操作的重要性。

📝 摘要（中文）

为了实现可靠的长程机器人操作，本文提出RoboStream，一个无需训练的框架。该框架通过时空融合令牌（STF-Tokens）实现几何锚定，将视觉证据与3D几何属性绑定，从而持久地进行对象定位。同时，利用因果时空图（CSTG）记录跨步骤的动作触发状态转换，维持因果连续性。这种设计使规划器能够追踪因果链，并在遮挡下保持对象持久性，无需额外的训练或微调。在长程RLBench上，RoboStream达到了90.5%的成功率，在具有挑战性的真实世界积木搭建任务中达到了44.4%的成功率，而SoFar和VoxPoser的得分均为11.1%。实验结果表明，时空推理和因果记忆是可靠长程操作的关键缺失组成部分。

🔬 方法详解

问题定义：现有基于视觉-语言模型的机器人规划方法将每个步骤视为孤立的观察-动作映射，需要在每个决策点重新推断场景几何信息，并且无法感知先前动作对环境的影响。这导致在长程任务中，感知错误会随着执行过程累积，被暂时遮挡的物体会被遗忘，从而导致违反前提条件，并在后续步骤中产生级联失败。

核心思路：RoboStream的核心思路是模仿人类的因果时空推理能力，通过维护一个持久的心理模型来持续跟踪空间关系和动作后果，而不是在每个时刻都重新构建它们。该方法通过几何锚定和因果连续性来解决长程操作中的感知误差累积问题。

技术框架：RoboStream框架包含两个主要组成部分：时空融合令牌（STF-Tokens）和因果时空图（CSTG）。STF-Tokens用于将视觉证据与3D几何属性绑定，实现持久的对象定位。CSTG用于记录跨步骤的动作触发状态转换，维持因果连续性。整个流程是，首先利用视觉信息提取STF-Tokens，然后利用CSTG进行推理和规划，最终输出动作指令。

关键创新：RoboStream的关键创新在于引入了STF-Tokens和CSTG，从而在视觉-语言模型中实现了几何锚定和因果记忆。与现有方法不同，RoboStream无需额外的训练或微调，即可在长程任务中保持对环境的持久理解和推理能力。

关键设计：STF-Tokens的设计关键在于如何有效地将视觉特征与3D几何属性进行融合。CSTG的设计关键在于如何准确地记录动作触发的状态转换，并利用这些信息进行因果推理。论文中具体的技术细节（如STF-Tokens的融合方式，CSTG的图结构和更新策略）未知。

🖼️ 关键图片

📊 实验亮点

RoboStream在长程RLBench任务中取得了90.5%的成功率，显著优于现有方法。在更具挑战性的真实世界积木搭建任务中，RoboStream的成功率为44.4%，而SoFar和VoxPoser的成功率均为11.1%，表明RoboStream在复杂环境下的长程操作能力具有显著优势。

🎯 应用场景

RoboStream具有广泛的应用前景，例如家庭服务机器人、工业自动化、医疗辅助机器人等。该研究成果可以提升机器人在复杂、动态环境中的操作能力，使其能够更好地完成长程任务，从而提高生产效率和服务质量。未来，该技术有望应用于更广泛的机器人领域，实现更智能、更可靠的机器人系统。

📄 摘要（原文）

Enabling reliable long-horizon robotic manipulation is a crucial step toward open-world embodied intelligence. However, VLM-based planners treat each step as an isolated observation-to-action mapping, forcing them to reinfer scene geometry from raw pixels at every decision point while remaining unaware of how prior actions have reshaped the environment. Despite strong short-horizon performance, these systems lack the spatio-temporal reasoning required for persistent geometric anchoring and memory of action-triggered state transitions. Without persistent state tracking, perceptual errors accumulate across the execution horizon, temporarily occluded objects are catastrophically forgotten, and these compounding failures lead to precondition violations that cascade through subsequent steps. In contrast, humans maintain a persistent mental model that continuously tracks spatial relations and action consequences across interactions rather than reconstructing them at each instant. Inspired by this human capacity for causal spatio-temporal reasoning with persistent memory, we propose RoboStream, a training-free framework that achieves geometric anchoring through Spatio-Temporal Fusion Tokens (STF-Tokens), which bind visual evidence to 3D geometric attributes for persistent object grounding, and maintains causal continuity via a Causal Spatio-Temporal Graph (CSTG) that records action-triggered state transitions across steps. This design enables the planner to trace causal chains and preserve object permanence under occlusion without additional training or fine-tuning. RoboStream achieves 90.5% on long-horizon RLBench and 44.4% on challenging real-world block-building tasks, where both SoFar and VoxPoser score 11.1%, demonstrating that spatio-temporal reasoning and causal memory are critical missing components for reliable long-horizon manipulation.

RoboStream: Weaving Spatio-Temporal Reasoning with Memory in Vision-Language Models for Robotics

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理