DGSG-Mind: Dynamic 3D Gaussian Scene Graphs for Long-Term Scene Understanding and Grounding
作者: Luzhou Ge, Xiangyu Zhu, Jinyan Liu, Xuesong Li
分类: cs.CV, cs.RO
发布日期: 2026-05-28 (更新: 2026-05-29)
备注: 9 pages, 6 figures
🔗 代码/项目: PROJECT_PAGE
💡 一句话要点
提出DGSG-Mind以解决动态3D场景理解中的实例关联脆弱问题
🎯 匹配领域: 支柱三:空间感知与语义 (Perception & Semantics) 支柱六:视频提取与匹配 (Video Extraction) 支柱九:具身大模型 (Embodied Foundation Models)
关键词: 动态场景理解 3D高斯 实例关联 增量语义映射 机器人技术
📋 核心要点
- 现有方法在动态3D场景理解中面临实例关联脆弱和拓扑变化处理能力不足的挑战。
- DGSG-Mind通过结合概率体素网格和3D高斯实现稳健的实例融合和增量语义映射。
- 实验结果显示,DGSG-Mind在零-shot 3DVG性能上优于其他方法,并在实际机器人应用中表现出色。
📝 摘要(中文)
将开放词汇语义信息整合到动态3D场景表示中对于长期的具身场景理解至关重要。然而,现有方法常因视角间线索不完整而导致实例关联脆弱,同时处理对象级拓扑变化的能力有限,限制了长期机器人任务的执行。为了解决这些挑战,本文提出了DGSG-Mind,一个混合实例感知的3D高斯动态场景图系统,结合了概率体素网格与显式3D高斯,以实现稳健的跨模态实例融合和增量语义映射。实验表明,DGSG-Mind在自重建地图上实现了最佳的零-shot 3DVG性能,并在3D开放词汇语义分割和场景重建中表现出色。
🔬 方法详解
问题定义:本论文旨在解决动态3D场景理解中实例关联脆弱和对象级拓扑变化处理能力不足的问题。现有方法往往依赖简单的特征匹配,缺乏明确的空间推理,且假设离线的真实3D几何结构。
核心思路:DGSG-Mind的核心思路是构建一个混合实例感知的3D高斯动态场景图系统,通过结合概率体素网格与显式3D高斯,增强跨模态实例融合的鲁棒性,并实现增量语义映射。
技术框架:DGSG-Mind的整体架构包括概率体素网格、3D高斯实例图和层次场景图。系统通过高斯视觉重定位和局部掩膜细化来处理动态变化,同时集成结构关系和空间语义信息。
关键创新:DGSG-Mind的主要创新在于其混合实例感知的设计,能够有效处理动态场景中的实例融合与语义映射,与现有方法相比,显著提升了对动态变化的适应能力。
关键设计:在关键设计上,DGSG-Mind采用了基于高斯的视觉重定位方法,并通过几何-语义一致性指导局部掩膜细化,确保了语义映射的准确性和鲁棒性。
🖼️ 关键图片
📊 实验亮点
DGSG-Mind在自重建地图上实现了最佳的零-shot 3DVG性能,相较于其他方法,提升幅度显著。此外,在3D开放词汇语义分割和场景重建任务中也表现出强劲的性能,验证了其有效性。
🎯 应用场景
DGSG-Mind在长期场景理解和机器人任务执行中具有广泛的应用潜力。其能够在动态环境中进行目标导向推理和实时更新,适用于智能家居、自动驾驶和服务机器人等领域,推动了机器人技术的智能化发展。
📄 摘要(原文)
Integrating open-vocabulary semantic information into dynamic 3D scene representations is essential for long-term embodied scene understanding. However, existing methods often suffer from fragile instance association due to incomplete cross-view cues, while their limited ability to handle object-level topological changes restricts long-term robotic task execution. Moreover, current 3D scene understanding methods either rely on simple feature matching without explicit spatial reasoning or assume offline ground-truth 3D geometry. To address these challenges, we present DGSG-Mind, a hybrid instance-aware 3D Gaussian dynamic scene graph system with an embodied reasoning agent. Our system couples a probabilistic voxel grid with explicit 3D Gaussians to enable robust cross-modal instance fusion and incremental semantic mapping. It handles dynamic changes through Gaussian-based visual relocalization and localized masked refinement guided by geometric-semantic consistency. Built on the instance Gaussian map, DGSG-Mind further constructs a hierarchical scene graph and develops the 3D Gaussian Mind, which integrates structural relations, spatial-semantic information, and visually annotated RoI Gaussian renderings for multimodal reasoning. Extensive experiments show that DGSG-Mind achieves the best zero-shot 3DVG performance among methods operating on self-reconstructed maps, while also delivering strong performance in 3D open-vocabulary semantic segmentation and scene reconstruction. We further deploy DGSG-Mind on real-world robots to demonstrate its target-oriented reasoning and dynamic update capabilities. The project page of DGSG-Mind is available at https://icr-lab.github.io/DGSG-Mind