DGSG-Mind: Dynamic 3D Gaussian Scene Graphs for Long-Term Scene Understanding and Grounding

作者: Luzhou Ge, Xiangyu Zhu, Jinyan Liu, Xuesong Li

分类: cs.CV, cs.RO

发布日期: 2026-05-28 (更新: 2026-05-29)

备注: 9 pages, 6 figures

🔗 代码/项目: PROJECT_PAGE

💡 一句话要点

提出DGSG-Mind以解决动态3D场景理解中的实例关联脆弱问题

🎯 匹配领域: 支柱三：空间感知与语义 (Perception & Semantics) 支柱六：视频提取与匹配 (Video Extraction) 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 动态场景理解 3D高斯 实例关联 增量语义映射 机器人技术

📋 核心要点

现有方法在动态3D场景理解中面临实例关联脆弱和拓扑变化处理能力不足的挑战。
DGSG-Mind通过结合概率体素网格和3D高斯实现稳健的实例融合和增量语义映射。
实验结果显示，DGSG-Mind在零-shot 3DVG性能上优于其他方法，并在实际机器人应用中表现出色。

📝 摘要（中文）

将开放词汇语义信息整合到动态3D场景表示中对于长期的具身场景理解至关重要。然而，现有方法常因视角间线索不完整而导致实例关联脆弱，同时处理对象级拓扑变化的能力有限，限制了长期机器人任务的执行。为了解决这些挑战，本文提出了DGSG-Mind，一个混合实例感知的3D高斯动态场景图系统，结合了概率体素网格与显式3D高斯，以实现稳健的跨模态实例融合和增量语义映射。实验表明，DGSG-Mind在自重建地图上实现了最佳的零-shot 3DVG性能，并在3D开放词汇语义分割和场景重建中表现出色。

🔬 方法详解

问题定义：本论文旨在解决动态3D场景理解中实例关联脆弱和对象级拓扑变化处理能力不足的问题。现有方法往往依赖简单的特征匹配，缺乏明确的空间推理，且假设离线的真实3D几何结构。

核心思路：DGSG-Mind的核心思路是构建一个混合实例感知的3D高斯动态场景图系统，通过结合概率体素网格与显式3D高斯，增强跨模态实例融合的鲁棒性，并实现增量语义映射。

技术框架：DGSG-Mind的整体架构包括概率体素网格、3D高斯实例图和层次场景图。系统通过高斯视觉重定位和局部掩膜细化来处理动态变化，同时集成结构关系和空间语义信息。

关键创新：DGSG-Mind的主要创新在于其混合实例感知的设计，能够有效处理动态场景中的实例融合与语义映射，与现有方法相比，显著提升了对动态变化的适应能力。

关键设计：在关键设计上，DGSG-Mind采用了基于高斯的视觉重定位方法，并通过几何-语义一致性指导局部掩膜细化，确保了语义映射的准确性和鲁棒性。

🖼️ 关键图片

📊 实验亮点

DGSG-Mind在自重建地图上实现了最佳的零-shot 3DVG性能，相较于其他方法，提升幅度显著。此外，在3D开放词汇语义分割和场景重建任务中也表现出强劲的性能，验证了其有效性。

🎯 应用场景

DGSG-Mind在长期场景理解和机器人任务执行中具有广泛的应用潜力。其能够在动态环境中进行目标导向推理和实时更新，适用于智能家居、自动驾驶和服务机器人等领域，推动了机器人技术的智能化发展。

📄 摘要（原文）

Integrating open-vocabulary semantic information into dynamic 3D scene representations is essential for long-term embodied scene understanding. However, existing methods often suffer from fragile instance association due to incomplete cross-view cues, while their limited ability to handle object-level topological changes restricts long-term robotic task execution. Moreover, current 3D scene understanding methods either rely on simple feature matching without explicit spatial reasoning or assume offline ground-truth 3D geometry. To address these challenges, we present DGSG-Mind, a hybrid instance-aware 3D Gaussian dynamic scene graph system with an embodied reasoning agent. Our system couples a probabilistic voxel grid with explicit 3D Gaussians to enable robust cross-modal instance fusion and incremental semantic mapping. It handles dynamic changes through Gaussian-based visual relocalization and localized masked refinement guided by geometric-semantic consistency. Built on the instance Gaussian map, DGSG-Mind further constructs a hierarchical scene graph and develops the 3D Gaussian Mind, which integrates structural relations, spatial-semantic information, and visually annotated RoI Gaussian renderings for multimodal reasoning. Extensive experiments show that DGSG-Mind achieves the best zero-shot 3DVG performance among methods operating on self-reconstructed maps, while also delivering strong performance in 3D open-vocabulary semantic segmentation and scene reconstruction. We further deploy DGSG-Mind on real-world robots to demonstrate its target-oriented reasoning and dynamic update capabilities. The project page of DGSG-Mind is available at https://icr-lab.github.io/DGSG-Mind

DGSG-Mind: Dynamic 3D Gaussian Scene Graphs for Long-Term Scene Understanding and Grounding

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理