Segment Any 3D Object with Language
作者: Seungjun Lee, Yuyang Zhao, Gim Hee Lee
分类: cs.CV, cs.AI
发布日期: 2024-04-02
备注: Project Page: https://cvrp-sole.github.io
💡 一句话要点
提出SOLE以解决开放词汇3D实例分割问题
🎯 匹配领域: 支柱三:空间感知与语义 (Perception & Semantics) 支柱九:具身大模型 (Embodied Foundation Models)
关键词: 开放词汇 3D实例分割 多模态融合 语义掩膜 机器人视觉
📋 核心要点
- 现有方法在处理未见新类别时泛化能力不足,导致性能受限。
- 提出SOLE框架,通过从3D点云直接生成与语义相关的掩膜,增强模型的泛化能力。
- 在多个基准测试中,SOLE显著超越了以往方法,接近完全监督学习的效果。
📝 摘要(中文)
本文研究了开放词汇3D实例分割(OV-3DIS)与自由形式语言指令的结合。以往依赖标注基础类别的训练方法在面对未见新类别时泛化能力有限。尽管近期研究通过生成类别无关的掩膜或将2D掩膜投影到3D来缓解这一问题,但忽视了语义或几何信息,导致性能不佳。我们提出的SOLE框架直接从3D点云生成与语义相关的掩膜,显著提升了分割效果。通过多模态融合网络,我们在ScanNetv2、ScanNet200和Replica基准上取得了优异的表现,结果接近完全监督学习的效果,且展示了对语言指令的良好适应性。
🔬 方法详解
问题定义:本文旨在解决开放词汇3D实例分割中的泛化能力不足问题。现有方法多依赖于标注的基础类别,无法有效处理未见的新类别,导致性能下降。
核心思路:我们提出SOLE框架,通过直接从3D点云生成与语义相关的掩膜,结合多模态语义信息,提升模型的泛化能力和分割质量。
技术框架:SOLE的整体架构包括多模态融合网络,分为主干网络和解码器两个主要部分。主干网络负责提取3D点云的特征,解码器则生成最终的语义掩膜。
关键创新:SOLE的核心创新在于引入了多模态关联作为监督信号,增强了模型对不同语言指令的适应性,显著提升了掩膜质量。
关键设计:在网络设计中,我们采用了特定的损失函数来优化掩膜生成质量,并通过多模态融合策略来整合不同来源的语义信息。
📊 实验亮点
SOLE在ScanNetv2、ScanNet200和Replica基准测试中表现优异,超越了以往方法,尤其是在没有类别注释的情况下,结果接近完全监督学习的效果,展示了其强大的泛化能力和适应性。
🎯 应用场景
该研究的潜在应用领域包括机器人视觉、自动驾驶、增强现实等,能够在复杂环境中实现高效的3D物体识别与分割。通过与自然语言指令结合,未来可实现更智能的人机交互,提升自动化系统的灵活性和适应性。
📄 摘要(原文)
In this paper, we investigate Open-Vocabulary 3D Instance Segmentation (OV-3DIS) with free-form language instructions. Earlier works that rely on only annotated base categories for training suffer from limited generalization to unseen novel categories. Recent works mitigate poor generalizability to novel categories by generating class-agnostic masks or projecting generalized masks from 2D to 3D, but disregard semantic or geometry information, leading to sub-optimal performance. Instead, generating generalizable but semantic-related masks directly from 3D point clouds would result in superior outcomes. In this paper, we introduce Segment any 3D Object with LanguagE (SOLE), which is a semantic and geometric-aware visual-language learning framework with strong generalizability by generating semantic-related masks directly from 3D point clouds. Specifically, we propose a multimodal fusion network to incorporate multimodal semantics in both backbone and decoder. In addition, to align the 3D segmentation model with various language instructions and enhance the mask quality, we introduce three types of multimodal associations as supervision. Our SOLE outperforms previous methods by a large margin on ScanNetv2, ScanNet200, and Replica benchmarks, and the results are even close to the fully-supervised counterpart despite the absence of class annotations in the training. Furthermore, extensive qualitative results demonstrate the versatility of our SOLE to language instructions.