SpaceVLN: A Zero-Shot Vision-and-Language Navigation Agent with Online Spatial Cognitive Memory and Reasoning

📄 arXiv: 2606.08992v1 📥 PDF

作者: Yucheng Deng, Pingrui Lai, Xinhai Li, Chenjia Bai, Xiaoheng Deng, Chengnuo Sun, Xuelong Li, Hua Yang

分类: cs.RO, cs.AI, cs.CV

发布日期: 2026-06-08

备注: 23 pages, 9 figures, 7 tables


💡 一句话要点

提出SpaceVLN以解决零-shot视觉语言导航问题

🎯 匹配领域: 支柱九:具身大模型 (Embodied Foundation Models)

关键词: 视觉语言导航 空间认知记忆 任务引导推理 零-shot学习 机器人导航

📋 核心要点

  1. 现有的视觉语言导航方法依赖局部视觉线索和线性历史推理,无法有效处理未见环境的空间结构。
  2. SpaceVLN通过引入空间认知记忆和任务引导空间推理,构建了一个高效的阶段性闭环框架,增强了导航能力。
  3. 在多个基准测试中,SpaceVLN实现了最先进的零-shot性能,验证了其在真实机器人中的应用潜力。

📝 摘要(中文)

在连续环境中进行视觉语言导航需要代理理解未见环境的空间结构,以遵循语言指令。尽管基础模型为零-shot导航提供了有希望的路径,但许多导航器仍依赖局部视觉线索和线性历史推理,忽视了探索区域、经过路径、地标及其空间关系的空间特性。本文提出了SpaceVLN,一个基于空间认知记忆和任务引导空间推理的导航代理。SpaceVLN引入了一种高效的阶段性闭环框架,组织规划和执行围绕可验证的空间-地标阶段。通过逐步抽象探索区域为空间航点,并动态维护子任务基础的地标证据,形成层次化的空间认知记忆,以实现进度定位和空间关系理解。基于此记忆,Spatial-CoT将任务进度推理与空间感知、分析和预测相结合,实现了任务引导的空间推理。SpaceVLN在R2R-CE、RxR-CE、GN-Bench和HM3D-OVON上实现了最先进的零-shot性能,真实机器人部署进一步验证了其适用性。

🔬 方法详解

问题定义:本文旨在解决视觉语言导航中代理对未见环境空间结构理解不足的问题。现有方法往往依赖局部视觉线索和线性推理,难以有效处理复杂的空间关系。

核心思路:SpaceVLN的核心思路是通过空间认知记忆和任务引导空间推理,构建一个能够动态维护空间信息的导航代理。这样设计的目的是为了提高代理在复杂环境中的导航能力和任务执行效率。

技术框架:SpaceVLN采用阶段性闭环框架,主要包括空间-地标阶段的规划与执行。代理在导航过程中逐步将探索区域抽象为空间航点,并维护与子任务相关的地标证据,形成层次化的空间认知记忆。

关键创新:SpaceVLN的主要创新在于引入了空间认知记忆和任务引导空间推理,使得代理能够在零-shot设置下有效处理视觉语言导航和目标导航任务。这与现有方法的线性推理方式形成了本质区别。

关键设计:在设计中,SpaceVLN采用了层次化的空间航点表示,结合了任务进度推理与空间感知,确保了代理在复杂环境中的高效导航。

🖼️ 关键图片

fig_0
fig_1
fig_2

📊 实验亮点

SpaceVLN在R2R-CE、RxR-CE、GN-Bench和HM3D-OVON等多个基准测试中实现了最先进的零-shot性能,具体表现为在这些任务中相较于基线方法提升了XX%的成功率,验证了其在真实机器人中的有效性和适用性。

🎯 应用场景

该研究的潜在应用领域包括智能机器人、自动驾驶、虚拟现实等场景,能够提升机器人在复杂环境中的自主导航能力。未来,SpaceVLN有望在更广泛的任务中应用,推动人机交互和智能系统的发展。

📄 摘要(原文)

Vision-and-Language Navigation in continuous environments requires agents to understand the spatial structure of previously unseen environments in order to follow language instructions. Although foundation models have opened a promising path toward zero-shot navigation without task-specific policy training, many navigators still rely on local visual cues and linear history-based reasoning, overlooking the spatial nature of navigation across explored regions, traversed paths, landmarks, and their spatial relations. In this paper, we propose SpaceVLN, a navigation agent built around Spatial Cognitive Memory and Task-Guided Spatial Reasoning. Specifically, SpaceVLN introduces an efficient stagewise closed-loop framework where planning and execution are organized around verifiable space--landmark stages. During navigation, the agent progressively abstracts explored regions into Spatial Waypoints and dynamically maintains subtask-grounded landmark evidence, forming a hierarchical Spatial Cognitive Memory for progress localization and spatial-relation understanding. Built on this memory, Spatial-CoT integrates task-progress reasoning with spatial perception, analysis, and prediction, enabling Task-Guided Spatial Reasoning for embodied navigation. The unified stage interface enables SpaceVLN to address both Vision-and-Language Navigation and Object-Goal Navigation under a unified zero-shot setting, without task-specific policy training. Across R2R-CE, RxR-CE, GN-Bench, and HM3D-OVON, SpaceVLN achieves state-of-the-art zero-shot performance, and real-robot deployment further validates its applicability. These results highlight Spatial Cognitive Memory and Task-Guided Spatial Reasoning as a practical foundation for stronger embodied navigation agents.