NeRF-MAE: Masked AutoEncoders for Self-Supervised 3D Representation Learning for Neural Radiance Fields

📄 arXiv: 2404.01300v3 📥 PDF

作者: Muhammad Zubair Irshad, Sergey Zakharov, Vitor Guizilini, Adrien Gaidon, Zsolt Kira, Rares Ambrus

分类: cs.CV, cs.AI, cs.LG

发布日期: 2024-04-01 (更新: 2024-07-18)

备注: Accepted to ECCV 2024. Project Page: https://nerf-mae.github.io/


💡 一句话要点

提出NeRF-MAE以解决自监督3D表示学习问题

🎯 匹配领域: 支柱二:RL算法与架构 (RL & Architecture) 支柱三:空间感知与语义 (Perception & Semantics)

关键词: 自监督学习 3D表示学习 神经辐射场 掩蔽自编码器 视觉变换器 物体检测 计算机视觉

📋 核心要点

  1. 现有的自监督3D表示学习方法在处理隐式表示(如NeRF)时面临挑战,难以有效提取语义信息。
  2. 本文提出的NeRF-MAE方法通过掩蔽自编码器,从NeRF的辐射和密度网格中提取显式表示,以学习场景的语义和空间结构。
  3. 在大规模预训练后,NeRF-MAE在多个3D任务上表现优异,特别是在3D物体检测任务中,显著提高了性能。

📝 摘要(中文)

神经场在计算机视觉和机器人领域表现出色,能够理解3D视觉世界,包括推断语义、几何和动态。本文提出了一种新的自监督预训练方法NeRF-MAE,利用掩蔽自编码器从已定位的RGB图像生成有效的3D表示。通过将NeRF的体积网格作为稠密输入,结合标准的3D视觉变换器,模型能够学习完整场景的语义和空间结构。我们在超过180万张图像上进行大规模预训练,结果显示NeRF-MAE在Front3D和ScanNet数据集上显著提升了3D物体检测的性能,AP50和AP25分别提高了超过20%和8%。

🔬 方法详解

问题定义:本文旨在解决如何有效利用掩蔽自编码器进行自监督3D表示学习的问题。现有方法在处理隐式表示(如NeRF)时,难以提取稳定的语义信息,导致性能不足。

核心思路:论文提出通过掩蔽随机区域的NeRF辐射和密度网格,结合标准的3D Swin变换器,重建被掩蔽的区域,从而学习完整场景的语义和空间结构。这样的设计使得模型能够在稠密输入中有效捕捉信息。

技术框架:整体架构包括数据采集、掩蔽处理、3D变换器重建和预训练阶段。首先,从已定位的RGB图像中提取数据,然后对NeRF的体积网格进行随机掩蔽,最后使用3D Swin变换器进行重建。

关键创新:最重要的创新在于将掩蔽自编码器应用于NeRF的显式表示,通过对相机轨迹进行采样,克服了隐式表示的限制。这一方法显著提升了自监督学习的效果。

关键设计:在模型设计中,采用标准的3D Swin变换器作为重建模块,设置了适当的掩蔽比例和损失函数,以确保模型能够有效学习场景的语义信息。

🖼️ 关键图片

fig_0
fig_1
fig_2

📊 实验亮点

在Front3D和ScanNet数据集上的实验结果显示,NeRF-MAE在3D物体检测任务中相较于自监督3D预训练和NeRF场景理解基线,AP50提高了超过20%,AP25提高了8%,展现了显著的性能提升。

🎯 应用场景

该研究的潜在应用领域包括自动驾驶、增强现实和虚拟现实等3D场景理解任务。通过有效的3D表示学习,NeRF-MAE能够为这些领域提供更准确的环境理解和物体检测能力,推动相关技术的发展和应用。

📄 摘要(原文)

Neural fields excel in computer vision and robotics due to their ability to understand the 3D visual world such as inferring semantics, geometry, and dynamics. Given the capabilities of neural fields in densely representing a 3D scene from 2D images, we ask the question: Can we scale their self-supervised pretraining, specifically using masked autoencoders, to generate effective 3D representations from posed RGB images. Owing to the astounding success of extending transformers to novel data modalities, we employ standard 3D Vision Transformers to suit the unique formulation of NeRFs. We leverage NeRF's volumetric grid as a dense input to the transformer, contrasting it with other 3D representations such as pointclouds where the information density can be uneven, and the representation is irregular. Due to the difficulty of applying masked autoencoders to an implicit representation, such as NeRF, we opt for extracting an explicit representation that canonicalizes scenes across domains by employing the camera trajectory for sampling. Our goal is made possible by masking random patches from NeRF's radiance and density grid and employing a standard 3D Swin Transformer to reconstruct the masked patches. In doing so, the model can learn the semantic and spatial structure of complete scenes. We pretrain this representation at scale on our proposed curated posed-RGB data, totaling over 1.8 million images. Once pretrained, the encoder is used for effective 3D transfer learning. Our novel self-supervised pretraining for NeRFs, NeRF-MAE, scales remarkably well and improves performance on various challenging 3D tasks. Utilizing unlabeled posed 2D data for pretraining, NeRF-MAE significantly outperforms self-supervised 3D pretraining and NeRF scene understanding baselines on Front3D and ScanNet datasets with an absolute performance improvement of over 20% AP50 and 8% AP25 for 3D object detection.