Thinking with Imagination: Agentic Visual Spatial Reasoning with World Simulators

📄 arXiv: 2606.06476v1 📥 PDF

作者: Chenming Zhu, Jingli Lin, Yilin Long, Peizhou Cao, Tai Wang, Jiangmiao Pang, Xihui Liu

分类: cs.CV

发布日期: 2026-06-04

备注: Project page: https://zcmax.github.io/projects/Thinking-With-Imagination


💡 一句话要点

提出Astra框架以解决视觉语言模型的空间推理问题

🎯 匹配领域: 支柱二:RL算法与架构 (RL & Architecture) 支柱六:视频提取与匹配 (Video Extraction) 支柱九:具身大模型 (Embodied Foundation Models)

关键词: 视觉语言模型 空间推理 世界模拟器 强化学习 想象推理 多视角观察 机器人导航 增强现实

📋 核心要点

  1. 现有的视觉语言模型在空间推理方面存在局限,难以处理未观察到的布局和视角一致性问题。
  2. 本文提出Astra框架,通过与世界模拟器交互,增强视觉语言模型的空间推理能力,主动获取想象的视觉证据。
  3. 实验结果显示,Astra-WM和Astra-VL的结合显著提高了模型在MMSI-Bench和MindCube上的表现,验证了想象观察的有效性。

📝 摘要(中文)

尽管视觉语言模型(VLMs)在视觉推理方面表现出色,但其空间推理能力仍受限于观察到的图像和文本链式思维。它们在推断未观察到的布局、保持视角一致性以及在仅有有限自我中心观察时进行推理方面存在困难。本文提出Astra,一个代理空间推理框架,通过与世界模拟器交互,主动获取想象的视觉证据。Astra结合了经过强化学习训练的VLM策略Astra-VL和基于Bagel的世界模拟器Astra-WM,后者能够从上下文图像和自然语言相机运动中生成新视角观察。实验表明,Astra-WM和代理策略都是必要的,显著提升了模型的空间推理能力。

🔬 方法详解

问题定义:本文旨在解决视觉语言模型在空间推理中的不足,尤其是在处理未观察到的布局和视角一致性时的挑战。现有方法往往依赖于直接观察,缺乏想象能力。

核心思路:Astra框架通过引入世界模拟器,使VLM能够主动获取想象的视觉证据,从而增强其空间推理能力。该设计旨在通过交互式学习提高模型的推理准确性。

技术框架:Astra框架包括两个主要模块:Astra-VL(强化学习训练的VLM策略)和Astra-WM(世界模拟器)。Astra-WM通过上下文图像和自然语言相机运动生成新视角观察,整个过程分为强化学习阶段和世界模拟器交互阶段。

关键创新:Astra的创新在于结合了代理策略与世界模拟器,通过想象的视觉证据提升空间推理能力。这种方法与传统的静态观察方式有本质区别,允许模型在推理过程中动态生成信息。

关键设计:在训练过程中,Astra-WM采用视角一致性调优,以提高不同视角下的姿态和内容一致性。此外,强化学习阶段采用了世界模拟器内循环的两阶段课程,以稳定工具使用探索并提升模型的想象能力。

🖼️ 关键图片

fig_0
fig_1
fig_2

📊 实验亮点

实验结果表明,Astra-WM显著提升了Gemini-3-Flash在MMSI-Bench上的表现,从45.1提高到49.5,而Astra-VL则将Qwen3-VL的性能从29.8提升至38.8,在MindCube上从36.8提升至42.7,验证了想象观察在空间推理中的有效性。

🎯 应用场景

该研究的潜在应用领域包括机器人导航、增强现实和虚拟现实等场景,能够帮助智能体在复杂环境中进行更有效的空间推理。通过增强的视觉推理能力,未来的智能系统将能够更好地理解和互动于其周围环境,提升用户体验和任务执行效率。

📄 摘要(原文)

While Vision-Language Models (VLMs) have shown strong visual reasoning capabilities, their spatial reasoning abilities remain largely constrained to the observed images and text-oriented chain-of-thought. They often struggle to infer unobserved layouts, maintain cross-view consistency, and reason from alternative viewpoints when only limited egocentric observations are available. In this work, we study this problem as thinking with imagination, where a VLM actively acquires imagined visual evidence by interacting with a world simulator during reasoning. We propose Astra, an agentic spatial reasoning framework that empowers VLMs with action-conditioned visual imagination. Specifically, Astra couples Astra-VL, an RL-trained VLM policy, with Astra-WM, a Bagel-based world simulator that generates novel-view observations from context images and natural-language camera motions. To provide reliable imagined evidence, Astra-WM is trained with view consistency tuning to improve pose and content consistency across views. In the RL stage, we propose a world-simulator-in-the-loop two-phase RL curriculum to stabilize tool-use exploration and advance the model's ability to invoke the simulator only when imagined observations improve over direct answering. Experiments demonstrate that both the world simulator and the agentic policy are necessary: Astra-WM improves simulator-augmented Gemini-3-Flash on MMSI-Bench from 45.1 to 49.5, while Astra-VL improves the Qwen3-VL backbone from 29.8 to 38.8 on MMSI-Bench and from 36.8 to 42.7 on MindCube. These results show that imagined observations can provide useful spatial evidence, but effective world-model-augmented reasoning requires learning when, where, and how to imagine.