A 3D Isovist World Model -- Revealing a City's Unseen Geometry and Its Emergent Cross-City Signature
作者: Xuhui Lin, Stephen Law, Nanjiang Chen, Kunyao Li, Tao Yang
分类: cs.RO, cs.LG
发布日期: 2026-06-02
💡 一句话要点
提出3D Isovist模型以解决城市导航中的几何预测问题
🎯 匹配领域: 支柱二:RL算法与架构 (RL & Architecture) 支柱三:空间感知与语义 (Perception & Semantics) 支柱九:具身大模型 (Embodied Foundation Models)
关键词: 3D Isovist 城市导航 几何建模 具身智能体 空间推理 机器人技术 城市分析
📋 核心要点
- 现有世界模型多关注场景外观,忽视了真实导航中可行走空间的几何特征,导致导航效果不佳。
- 本文提出通过3D Isovist模型捕捉建筑间的开放体积,预测智能体的可导航几何,避免了光度混淆。
- 实验结果表明,训练于曼哈顿和巴黎的模型能够提取出跨城市的空间特征,超越单帧基线表现。
📝 摘要(中文)
具身智能体在城市中导航依赖于能够预测周围环境变化的世界模型。现有模型多关注外观而非可导航空间,导致忽视了三维环境的复杂性。本文提出了一种3D Isovist模型,通过记录建筑间的开放体积,捕捉可导航几何形状,避免了光度混淆和维度压缩。我们构建了一个能够根据历史Isovist和移动动作预测下一个Isovist的模型,发现该模型在不同城市间展现出一致的空间特征,具有轻量、可解释和可复现的优点,为具身AI、机器人和城市分析提供了几何基础。
🔬 方法详解
问题定义:本文旨在解决现有导航模型对三维环境几何的忽视,尤其是如何有效捕捉可导航空间的结构特征。现有方法如鸟瞰图占用网格将三维环境压缩为二维,丧失了多层次结构信息。
核心思路:论文提出的3D Isovist模型通过记录建筑间的开放体积,形成一个球形可视深度图,能够有效捕捉智能体在环境中的可导航几何。该模型基于历史Isovist和移动动作进行预测,避免了光度混淆。
技术框架:整体架构包括历史Isovist的输入、深度残差预测模块和持久的鸟瞰图空间映射。模型通过自回归采样保持几何的连贯性,并确保预测的准确性。
关键创新:最重要的创新在于提出了3D Isovist作为预测目标,能够在不丢失三维信息的情况下,捕捉智能体的可导航空间。这与传统方法显著不同,后者往往忽视了环境的多层次结构。
关键设计:模型设计中采用了深度残差网络以保留建筑边缘的清晰度,损失函数设计为自回归采样,以保持几何流形的完整性。
📊 实验亮点
实验结果显示,训练于曼哈顿和巴黎的模型能够提取出跨城市的空间特征,城市身份可以从其时间潜变量中线性解码,性能显著高于单帧基线,展示了模型在不同城市间的一致性和有效性。
🎯 应用场景
该研究的潜在应用领域包括智能城市规划、机器人导航和增强现实等。通过提供准确的可导航几何信息,能够显著提升具身智能体在复杂城市环境中的导航能力,推动相关技术的发展与应用。
📄 摘要(原文)
Embodied agents that navigate cities rely on world models that predict how their surroundings will change as they move. But for navigation, what matters is not what the buildings look like; it is where the agent can go. Most world models nonetheless predict appearance, learning how a scene looks rather than the space an agent can move through. Those that do target geometry, such as bird's-eye-view occupancy grids, flatten the three-dimensional environment onto a ground plane, discarding the above-ground and multi-level structure that shapes real navigation. What is missing is a predictive target that captures the navigable geometry an agent actually traverses, without photometric entanglement and without collapsing the third dimension. Our key idea is to model the open volume between buildings, the negative space, encoded as a 3D isovist: a spherical visibility-depth map recording the distance to the nearest surface in every direction. We introduce an embodied world model that predicts the next isovist from a short history of past isovists and a movement action. The prediction is formulated as a depth residual so the decoder inherits sharp building edges, trained with self-rollout scheduled sampling to keep corrupted context on the geometry manifold, and equipped with a persistent latent bird's-eye-view spatial map for cross-path consistency. Our central finding is emergent and unexpected: a single city-blind model trained on Manhattan and Paris develops a cross-city spatial signature, with city identity linearly decodable from its temporal latents far above single-frame baselines, so the signature lives in the learned dynamics rather than in appearance. The representation is lightweight, interpretable, and reproducible, offering a geometric substrate for spatial reasoning in embodied AI, robotics, and urban analysis, released with an open dataset and pipeline.