MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations

作者: Ruiyuan Lyu, Jingli Lin, Tai Wang, Shuai Yang, Xiaohan Mao, Yilun Chen, Runsen Xu, Haifeng Huang, Chenming Zhu, Dahua Lin, Jiangmiao Pang

分类: cs.CV, cs.AI, cs.RO

发布日期: 2024-06-13 (更新: 2025-06-09)

备注: Follow-up of EmbodiedScan (camera-ready version). A multi-modal 3D dataset with the most-ever comprehensive language annotations for 3D-LLMs. Project page: https://tai-wang.github.io/mmscan/

🔗 代码/项目: GITHUB

💡 一句话要点

MMScan：构建具有分层语言标注的多模态3D场景数据集，促进3D感知研究。

🎯 匹配领域: 支柱七：动作重定向 (Motion Retargeting) 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 多模态3D感知 3D场景理解 视觉语言模型 数据集构建 视觉接地

📋 核心要点

现有3D场景数据集在对象属性和空间关系理解方面存在局限性，缺乏全面的多模态标注。
MMScan通过自顶向下的方式构建，利用VLMs初始化标注，并结合人工校正，保证标注质量。
实验表明，基于MMScan训练的模型在3D视觉接地和LLM任务上，性能得到显著提升。

📝 摘要（中文）

随着大型语言模型（LLMs）的兴起及其与其他数据模态的集成，多模态3D感知因其与物理世界的连接而备受关注并取得快速进展。然而，由于现有数据集的限制，先前的工作主要集中于理解3D场景中的对象属性或对象间的空间关系。为了解决这个问题，本文构建了有史以来最大的多模态3D场景数据集和基准，具有分层的、接地的语言标注，即MMScan。它基于自顶向下的逻辑构建，从区域到对象级别，从单个目标到目标间的关系，涵盖了空间和属性理解的整体方面。整体流程结合了强大的视觉语言模型（VLMs），通过精心设计的提示来高效地初始化标注，并进一步让人工校正参与其中，以确保标注的自然、正确和全面。基于现有的3D扫描数据，生成的多模态3D数据集包含140万个关于10.9万个对象和7700个区域的元标注标题，以及超过304万个用于3D视觉接地和问答基准的多样化样本。我们在我们的基准上评估了代表性的基线，分析了它们在不同方面的能力，并展示了未来需要解决的关键问题。此外，我们使用这个高质量的数据集来训练最先进的3D视觉接地和LLM，并在现有基准和实际评估中都获得了显著的性能提升。代码、数据集和基准将在https://github.com/OpenRobotLab/EmbodiedScan上提供。

🔬 方法详解

问题定义：现有3D场景数据集主要关注对象属性和对象间关系，缺乏对场景的整体理解和细粒度的语言描述。这限制了多模态3D感知模型的发展，尤其是在需要理解复杂场景和进行推理的任务中。现有数据集标注成本高昂，难以扩展到大规模场景。

核心思路：本文的核心思路是构建一个大规模、多模态的3D场景数据集，并采用一种高效的标注流程。该流程利用视觉语言模型（VLMs）生成初始标注，然后通过人工校正来提高标注质量。这种方法旨在降低标注成本，同时保证标注的准确性和全面性。

技术框架：MMScan数据集的构建流程主要包括以下几个阶段：1) 数据收集：基于现有的3D扫描数据。2) 区域分割：将3D场景分割成不同的区域。3) VLM标注：使用精心设计的提示，利用VLMs为每个区域和对象生成初始的语言描述。4) 人工校正：人工审核和修改VLM生成的标注，确保标注的自然、正确和全面。5) 基准测试：构建3D视觉接地和问答基准，并评估现有模型在这些基准上的性能。

关键创新：MMScan的关键创新在于其分层的、接地的语言标注体系和高效的标注流程。分层标注体系允许模型从不同粒度理解场景，从区域到对象，从单个目标到目标间关系。高效的标注流程结合了VLMs和人工校正，降低了标注成本，同时保证了标注质量。

关键设计：在VLM标注阶段，论文作者设计了多种提示，以引导VLMs生成更准确和全面的描述。例如，对于对象属性的描述，可以使用“What is the color/material/shape of this object?”等提示。对于对象间关系的描述，可以使用“What is the spatial relationship between object A and object B?”等提示。人工校正阶段则侧重于纠正VLM生成的错误描述，并补充缺失的信息。

🖼️ 关键图片

📊 实验亮点

MMScan数据集包含140万个关于10.9万个对象和7700个区域的元标注标题，以及超过304万个用于3D视觉接地和问答基准的样本。实验表明，基于MMScan训练的模型在3D视觉接地和LLM任务上取得了显著的性能提升，并在现有基准和实际评估中都表现出色。

🎯 应用场景

MMScan数据集可广泛应用于机器人导航、场景理解、虚拟现实、增强现实等领域。高质量的标注数据能够提升3D视觉接地、视觉问答等任务的性能，促进智能体与物理世界的交互能力。未来可用于训练更强大的多模态模型，实现更智能的3D场景理解和推理。

📄 摘要（原文）

With the emergence of LLMs and their integration with other data modalities, multi-modal 3D perception attracts more attention due to its connectivity to the physical world and makes rapid progress. However, limited by existing datasets, previous works mainly focus on understanding object properties or inter-object spatial relationships in a 3D scene. To tackle this problem, this paper builds the first largest ever multi-modal 3D scene dataset and benchmark with hierarchical grounded language annotations, MMScan. It is constructed based on a top-down logic, from region to object level, from a single target to inter-target relationships, covering holistic aspects of spatial and attribute understanding. The overall pipeline incorporates powerful VLMs via carefully designed prompts to initialize the annotations efficiently and further involve humans' correction in the loop to ensure the annotations are natural, correct, and comprehensive. Built upon existing 3D scanning data, the resulting multi-modal 3D dataset encompasses 1.4M meta-annotated captions on 109k objects and 7.7k regions as well as over 3.04M diverse samples for 3D visual grounding and question-answering benchmarks. We evaluate representative baselines on our benchmarks, analyze their capabilities in different aspects, and showcase the key problems to be addressed in the future. Furthermore, we use this high-quality dataset to train state-of-the-art 3D visual grounding and LLMs and obtain remarkable performance improvement both on existing benchmarks and in-the-wild evaluation. Codes, datasets, and benchmarks will be available at https://github.com/OpenRobotLab/EmbodiedScan.

MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理