CleverDistiller: Simple and Spatially Consistent Cross-modal Distillation

作者: Hariprasath Govindarajan, Maciej K. Wozniak, Marvin Klingner, Camille Maurice, B Ravi Kiran, Senthil Yogamani

分类: cs.CV, cs.AI, cs.RO

发布日期: 2025-03-12 (更新: 2025-11-21)

备注: Accepted to BMVC 2025

💡 一句话要点

CleverDistiller：一种简单且空间一致的跨模态知识蒸馏方法，提升3D感知性能。

🎯 匹配领域: 支柱二：RL算法与架构 (RL & Architecture) 支柱三：空间感知与语义 (Perception & Semantics) 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 跨模态学习 知识蒸馏 3D感知 自动驾驶 特征对齐

📋 核心要点

现有跨模态知识蒸馏方法依赖复杂损失、伪语义地图，或仅限于语义分割特征，限制了3D模型性能。
CleverDistiller采用直接特征相似性损失和MLP投影头，无需伪语义地图，实现高效的2D到3D知识迁移。
引入occupancy prediction辅助任务，增强3D空间推理能力，实验表明在自动驾驶任务上显著提升性能。

📝 摘要（中文）

本文提出CleverDistiller，一个自监督的跨模态2D到3D知识蒸馏框架。该框架通过一系列简单而有效的设计选择，将视觉基础模型(VFMs)的泛化能力迁移到基于3D LiDAR的模型中。不同于依赖复杂损失设计的对比方法，CleverDistiller采用直接特征相似性损失，并结合多层感知机(MLP)投影头，使3D网络能够学习投影过程中复杂的语义依赖关系。重要的是，该方法不依赖于伪语义地图，可以直接从VFM进行知识迁移，无需显式的语义监督。此外，本文还引入了辅助的自监督空间任务—— occupancy prediction，以增强从VFM通过知识蒸馏获得的语义知识，并赋予其3D空间推理能力。在标准自动驾驶基准测试上的实验表明，CleverDistiller在语义分割和3D目标检测(3DOD)方面均实现了最先进的性能，mIoU提升高达10%，尤其是在极少量数据上进行微调时，证明了该简单而强大的知识蒸馏策略的有效性。

🔬 方法详解

问题定义：现有的跨模态知识蒸馏方法在将2D视觉基础模型（VFMs）的知识迁移到3D LiDAR模型时，存在依赖复杂的对比损失函数、需要生成伪语义标签以及仅关注语义分割相关特征等问题。这些问题限制了知识迁移的效率和泛化能力，使得3D模型难以充分利用2D视觉模型的强大特征表示能力。

核心思路：CleverDistiller的核心思路是通过一种简单而直接的特征相似性损失，结合MLP投影头，实现从2D视觉特征到3D LiDAR特征的知识迁移。这种方法避免了复杂的对比学习和伪标签生成过程，简化了知识蒸馏流程。同时，引入occupancy prediction作为辅助任务，增强3D模型对空间信息的理解能力。

技术框架：CleverDistiller框架主要包含以下几个模块：1) 2D视觉基础模型（VFM）：提取2D图像的特征表示。2) 3D LiDAR模型：接收3D点云数据，并生成相应的特征表示。3) MLP投影头：将2D视觉特征投影到与3D LiDAR特征相同的空间，以便进行特征对齐。4) 特征相似性损失：计算2D投影特征和3D LiDAR特征之间的相似度，并最小化差异。5) Occupancy Prediction模块：预测3D空间中每个体素的占据状态，作为辅助任务。

关键创新：CleverDistiller的关键创新在于其简单性和有效性。它避免了复杂的对比学习和伪标签生成，而是采用直接的特征相似性损失进行知识迁移。此外，引入occupancy prediction作为辅助任务，增强了3D模型对空间信息的理解能力。这种简单而有效的设计使得CleverDistiller在跨模态知识蒸馏任务中取得了显著的性能提升。

关键设计：CleverDistiller的关键设计包括：1) 直接特征相似性损失：使用L2损失或余弦相似度损失来衡量2D投影特征和3D LiDAR特征之间的相似度。2) MLP投影头：使用多层感知机将2D视觉特征投影到与3D LiDAR特征相同的空间，以便进行特征对齐。3) Occupancy Prediction模块：将3D空间划分为体素，并预测每个体素的占据状态。使用交叉熵损失来训练Occupancy Prediction模块。

🖼️ 关键图片

📊 实验亮点

实验结果表明，CleverDistiller在自动驾驶数据集上取得了显著的性能提升。在语义分割任务中，mIoU提升高达10%。在3D目标检测任务中，也取得了明显的性能提升。尤其是在少量数据上进行微调时，CleverDistiller的优势更加明显，表明其具有很强的泛化能力。

🎯 应用场景

CleverDistiller在自动驾驶、机器人导航、三维场景理解等领域具有广泛的应用前景。通过将2D视觉模型的知识迁移到3D模型，可以提高3D感知的准确性和鲁棒性，从而提升自动驾驶系统的安全性。此外，该方法还可以应用于机器人导航，帮助机器人更好地理解周围环境，实现自主导航。在三维场景理解方面，CleverDistiller可以用于构建更精确的三维地图，为虚拟现实、增强现实等应用提供支持。

📄 摘要（原文）

Vision foundation models (VFMs) such as DINO have led to a paradigm shift in 2D camera-based perception towards extracting generalized features to support many downstream tasks. Recent works introduce self-supervised cross-modal knowledge distillation (KD) as a way to transfer these powerful generalization capabilities into 3D LiDAR-based models. However, they either rely on highly complex distillation losses, pseudo-semantic maps, or limit KD to features useful for semantic segmentation only. In this work, we propose CleverDistiller, a self-supervised, cross-modal 2D-to-3D KD framework introducing a set of simple yet effective design choices: Unlike contrastive approaches relying on complex loss design choices, our method employs a direct feature similarity loss in combination with a multi layer perceptron (MLP) projection head to allow the 3D network to learn complex semantic dependencies throughout the projection. Crucially, our approach does not depend on pseudo-semantic maps, allowing for direct knowledge transfer from a VFM without explicit semantic supervision. Additionally, we introduce the auxiliary self-supervised spatial task of occupancy prediction to enhance the semantic knowledge, obtained from a VFM through KD, with 3D spatial reasoning capabilities. Experiments on standard autonomous driving benchmarks for 2D-to-3D KD demonstrate that CleverDistiller achieves state-of-the-art performance in both semantic segmentation and 3D object detection (3DOD) by up to 10% mIoU, especially when fine tuning on really low data amounts, showing the effectiveness of our simple yet powerful KD strategy

CleverDistiller: Simple and Spatially Consistent Cross-modal Distillation

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理