Direction-aware 3D Large Multimodal Models
作者: Quan Liu, Weihao Xuan, Junjue Wang, Naoto Yokoya, Ling Shao, Shijian Lu
分类: cs.CV
发布日期: 2026-02-22
备注: In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026
💡 一句话要点
提出方向感知的3D大规模多模态模型以解决缺乏自我姿态的问题
🎯 匹配领域: 支柱九:具身大模型 (Embodied Foundation Models)
关键词: 方向感知 3D多模态模型 自我姿态 空间推理 点云数据 PoseRecover PoseAlign
📋 核心要点
- 现有的3D大规模多模态模型在处理方向性查询时缺乏自我姿态,导致模型性能受限。
- 本文提出PoseRecover和PoseAlign两个模块,自动识别自我姿态并将点云数据与之对齐,提升模型的方向感知能力。
- 实验显示,所提方法在多个基准上显著提升了性能,如ScanRefer mIoU提高30.0%,Scan2Cap LLM-as-judge准确率提高11.7%。
📝 摘要(中文)
3D大规模多模态模型(3D LMMs)在实现方向性问答和空间推理时,严重依赖自我姿态。然而,现有的点云基准虽然包含丰富的方向性查询,却缺乏相应的自我姿态,导致在3D大规模多模态建模中固有的不适定性。本文重新定义了一种新的严格范式,通过识别和补充自我姿态到点云基准中,并根据识别的自我姿态转换相应的点云数据,从而实现方向感知的3D LMMs。我们提出了两个新颖的设计:PoseRecover和PoseAlign,实验结果表明我们的设计在多个3D LMM骨干网络上均取得了一致的提升。
🔬 方法详解
问题定义:本文旨在解决现有3D大规模多模态模型在处理方向性查询时缺乏自我姿态的问题,导致模型在空间推理上的表现不佳。
核心思路:通过引入PoseRecover和PoseAlign模块,自动识别并补充自我姿态,从而使点云数据能够与自我姿态对齐,增强模型的方向感知能力。
技术框架:整体架构包括两个主要模块:PoseRecover用于从RGB-D视频的外部参数中自动恢复自我姿态,PoseAlign则负责将点云数据转换为与识别的自我姿态对齐的形式。
关键创新:PoseRecover通过物体-视锥体交集和Z-buffer可见性检查实现自我姿态的自动恢复,而PoseAlign则避免了将自我姿态注入文本提示或在投影层引入姿态编码特征的复杂性。
关键设计:在PoseRecover中,采用了RGB-D视频的外部参数进行姿态匹配,确保了高效性和准确性;PoseAlign则通过对点云数据进行几何变换,确保其与自我姿态的对齐,提升了模型的整体性能。
📊 实验亮点
实验结果表明,所提出的方法在多个3D LMM骨干网络上均取得了显著提升,ScanRefer的mIoU提高了30.0%,Scan2Cap的LLM-as-judge准确率提高了11.7%,展示了方法的有效性和广泛适用性。
🎯 应用场景
该研究的潜在应用领域包括增强现实、机器人导航和智能监控等,能够有效提升系统在复杂环境中的空间理解和交互能力。未来,该方法有望推动3D视觉理解和多模态交互的进一步发展。
📄 摘要(原文)
3D large multimodal models (3D LMMs) rely heavily on ego poses for enabling directional question-answering and spatial reasoning. However, most existing point cloud benchmarks contain rich directional queries but lack the corresponding ego poses, making them inherently ill-posed in 3D large multimodal modelling. In this work, we redefine a new and rigorous paradigm that enables direction-aware 3D LMMs by identifying and supplementing ego poses into point cloud benchmarks and transforming the corresponding point cloud data according to the identified ego poses. We enable direction-aware 3D LMMs with two novel designs. The first is PoseRecover, a fully automatic pose recovery pipeline that matches questions with ego poses from RGB-D video extrinsics via object-frustum intersection and visibility check with Z-buffers. The second is PoseAlign that transforms the point cloud data to be aligned with the identified ego poses instead of either injecting ego poses into textual prompts or introducing pose-encoded features in the projection layers. Extensive experiments show that our designs yield consistent improvements across multiple 3D LMM backbones such as LL3DA, LL3DA-SONATA, Chat-Scene, and 3D-LLAVA, improving ScanRefer mIoU by 30.0% and Scan2Cap LLM-as-judge accuracy by 11.7%. In addition, our approach is simple, generic, and training-efficient, requiring only instruction tuning while establishing a strong baseline for direction-aware 3D-LMMs.