SleepWalk: A Three-Tier Benchmark for Stress-Testing Instruction-Guided Vision-Language Navigation

📄 arXiv: 2605.10376v1 📥 PDF

作者: Niyati Rawal, Sushant Ravva, Shah Alam Abir, Saksham Jain, Aman Chadha, Vinija Jain, Suranjana Trivedy, Amitava Das

分类: cs.CV

发布日期: 2026-05-11


💡 一句话要点

提出SleepWalk基准测试,旨在压力测试指令引导下的视觉语言导航与具身推理能力

🎯 匹配领域: 支柱九:具身大模型 (Embodied Foundation Models)

关键词: 具身智能 视觉语言导航 空间推理 多模态基准 轨迹预测 3D环境理解

📋 核心要点

  1. 现有VLMs在将自然语言指令转化为3D空间中具备几何一致性与可执行性的动作序列方面,仍缺乏可靠的推理与导航能力。
  2. 提出SleepWalk基准,通过构建包含不同空间与时间复杂度的三层级任务,对模型在局部交互场景下的具身导航能力进行压力测试。
  3. 实验表明,当前前沿VLMs在处理遮挡、复杂交互约束及多步指令时表现出系统性失效,性能随任务难度增加而显著下降。

📝 摘要(中文)

视觉语言模型(VLMs)在多模态感知与语言理解方面进展迅速,但其能否在3D数字环境中将语言可靠地转化为空间一致且可执行的动作仍不明确。为此,本文提出了SleepWalk基准,用于评估基于文本场景描述生成的单场景3D世界中的指令引导轨迹预测。与以往侧重跨房间长距离探索的导航基准不同,SleepWalk聚焦于局部、以交互为中心的具身推理:模型需根据视觉观测和自然语言指令,预测符合场景几何约束、避免碰撞并终止于交互兼容位置的轨迹。该基准涵盖多种室内外环境,并将任务按空间与时间复杂度分为三个层级,以实现对复杂组合推理的细粒度分析。通过对2,472个3D环境的评估,研究揭示了现有前沿VLMs在遮挡处理、交互约束及多步指令下的系统性缺陷,为具身智能与多模态推理研究提供了关键评估工具。

🔬 方法详解

问题定义:论文旨在解决VLMs在具身导航中“空间接地(Spatial Grounding)”能力不足的问题。现有方法多关注长距离路径规划,忽略了在局部场景中,模型如何将指令精确映射为符合物理几何约束、避障且满足特定交互需求的动作序列。

核心思路:通过构建一个受控且可扩展的3D基准测试,将导航任务拆解为三个难度层级。该设计旨在通过系统性地增加空间与时间复杂度,暴露模型在处理遮挡、多步逻辑及交互约束时的推理瓶颈,从而量化模型在具身环境中的真实理解水平。

技术框架:SleepWalk基于文本生成的单场景3D世界构建,包含2,472个环境。流程上,模型接收渲染的视觉观测与自然语言指令,输出轨迹预测。评估采用基于点位(pointwise)的裁判机制,衡量轨迹的空间一致性、可执行性及任务完成度。

关键创新:引入了“交互中心(Interaction-centric)”的导航评估范式,而非单纯的路径长度或成功率。通过分层难度设计,能够精细化分析模型在不同组合复杂度下的推理失效模式,填补了局部具身推理评估的空白。

关键设计:采用了标准化裁判协议,对轨迹的几何合理性(如碰撞检测)和语义对齐度进行量化。基准涵盖了多样化的室内外场景,确保了评估结果在不同环境分布下的鲁棒性与泛化性分析。

📊 实验亮点

SleepWalk在2,472个 curated 3D环境上对三款前沿VLMs进行了压力测试。实验结果显示,随着任务层级复杂度提升,模型性能出现显著下降,特别是在处理遮挡环境和多步指令时表现出明显的系统性失效。该研究量化了模型在空间一致性与动作可执行性上的局限,为具身导航领域提供了关键的性能基准与改进方向。

🎯 应用场景

该研究主要应用于具身智能体(Embodied Agents)的开发与评估,特别是在家庭服务机器人、自动驾驶辅助系统及虚拟现实交互领域。通过SleepWalk基准,开发者能更精准地定位模型在复杂空间推理中的缺陷,从而推动具备高鲁棒性、能理解复杂指令并与物理环境进行安全交互的下一代智能体的发展。

📄 摘要(原文)

Vision-Language Models (VLMs) have advanced rapidly in multimodal perception and language understanding, yet it remains unclear whether they can reliably ground language into spatially coherent, plausibly executable actions in 3D digital environments. We introduce SleepWalk, a benchmark for evaluating instruction-grounded trajectory prediction in single-scene 3D worlds generated from textual scene descriptions and filtered for navigability. Unlike prior navigation benchmarks centered on long-range exploration across rooms, SleepWalk targets localized, interaction-centric embodied reasoning: given rendered visual observations and a natural-language instruction, a model must predict a trajectory that respects scene geometry, avoids collisions, and terminates at an action-compatible location. The benchmark covers diverse indoor and outdoor environments and organizes tasks into three tiers of spatial and temporal difficulty, enabling fine-grained analysis of grounding under increasing compositional complexity. Using a standardized pointwise judge-based evaluation protocol, we evaluate three frontier VLMs on 2,472 curated 3D environments with nine instructions per scene. Results reveal systematic failures in grounded spatial reasoning, especially under occlusion, interaction constraints, and multi-step instructions: performance drops as the difficulty level of the tasks increase. In general, current VLMs can somewhat produce trajectories that are simultaneously spatially coherent, plausibly executable, and aligned with intended actions. By exposing failures in a controlled yet scalable setting, SleepWalk provides a critical benchmark for advancing grounded multimodal reasoning, embodied planning, vision-language navigation, and action-capable agents in 3D environments.