Modulating CNN Features with Pre-Trained ViT Representations for Open-Vocabulary Object Detection

作者: Xiangyu Gao, Yu Dai, Benliu Qiu, Lanxiao Wang, Heqian Qiu, Hongliang Li

分类: cs.CV

发布日期: 2025-01-28 (更新: 2025-03-06)

💡 一句话要点

提出VMCNet以解决开放词汇物体检测中的表示不足问题

🎯 匹配领域: 支柱三：空间感知与语义 (Perception & Semantics)

关键词: 开放词汇检测 卷积神经网络 视觉语言模型 多尺度特征 物体检测 深度学习 模型调制

📋 核心要点

现有的开放词汇物体检测方法大多依赖于冻结的预训练模型，无法充分利用标注数据来增强检测性能。
本文提出了一种双分支网络VMCNet，结合可训练的CNN分支与冻结的ViT分支，以提升物体检测的表示能力。
在OV-COCO和OV-LVIS基准上，VMCNet在新类别检测上表现优异，分别达到了44.3和27.8的mAP值，显著超越了现有方法。

📝 摘要（中文）

由于大规模图像-文本对比训练，预训练的视觉语言模型（VLM）如CLIP展现出优越的开放词汇识别能力。现有的开放词汇物体检测器大多尝试利用预训练的VLM来获得通用表示，但F-ViT冻结的主干网络无法利用标注数据增强检测表示。为此，本文提出了一种新颖的双分支主干网络VMCNet，包括一个可训练的卷积分支和一个冻结的预训练ViT分支，以及一个VMC模块。可训练的CNN分支可以通过标注数据进行优化，而冻结的ViT分支则保持来自大规模预训练的表示能力。VMC模块能够调制多尺度CNN特征与ViT分支的表示。通过这种混合结构，检测器更有可能发现新类别的物体。实验结果表明，该方法在OV-COCO和OV-LVIS基准上均超越了现有最先进的方法。

🔬 方法详解

问题定义：本文旨在解决开放词汇物体检测中，现有方法因冻结主干网络而无法利用标注数据增强表示能力的问题。

核心思路：提出的VMCNet通过引入可训练的卷积分支与冻结的ViT分支，结合VMC模块调制多尺度特征，从而提升检测性能。

技术框架：VMCNet由三个主要模块组成：可训练的CNN分支用于优化，冻结的ViT分支保持预训练表示能力，以及VMC模块用于特征调制。

关键创新：VMCNet的创新在于双分支结构的设计，使得检测器能够同时利用标注数据和预训练模型的优势，显著提升新类别的检测能力。

关键设计：在网络结构上，CNN分支采用标准卷积层，ViT分支使用冻结的视觉编码器，VMC模块则通过特征融合和调制实现多尺度信息的整合。

🖼️ 关键图片

📊 实验亮点

在OV-COCO数据集上，VMCNet达到了44.3的AP${50}^{ ext{novel}}$，在OV-LVIS上则达到了27.8的mAP${r}$，均显著优于现有最先进的方法，展示了其在新类别检测上的强大能力。

🎯 应用场景

该研究在开放词汇物体检测领域具有广泛的应用潜力，能够有效识别新类别物体，适用于智能监控、自动驾驶、机器人视觉等多个场景。未来，该方法有望推动更复杂场景下的物体检测技术发展。

📄 摘要（原文）

Owing to large-scale image-text contrastive training, pre-trained vision language model (VLM) like CLIP shows superior open-vocabulary recognition ability. Most existing open-vocabulary object detectors attempt to utilize the pre-trained VLMs to attain generalized representation. F-ViT uses the pre-trained visual encoder as the backbone network and freezes it during training. However, its frozen backbone doesn't benefit from the labeled data to strengthen the representation for detection. Therefore, we propose a novel two-branch backbone network, named as \textbf{V}iT-Feature-\textbf{M}odulated Multi-Scale \textbf{C}onvolutional Network (VMCNet), which consists of a trainable convolutional branch, a frozen pre-trained ViT branch and a VMC module. The trainable CNN branch could be optimized with labeled data while the frozen pre-trained ViT branch could keep the representation ability derived from large-scale pre-training. Then, the proposed VMC module could modulate the multi-scale CNN features with the representations from ViT branch. With this proposed mixed structure, the detector is more likely to discover objects of novel categories. Evaluated on two popular benchmarks, our method boosts the detection performance on novel category and outperforms state-of-the-art methods. On OV-COCO, the proposed method achieves 44.3 AP${50}^{\mathrm{novel}}$ with ViT-B/16 and 48.5 AP${50}^{\mathrm{novel}}$ with ViT-L/14. On OV-LVIS, VMCNet with ViT-B/16 and ViT-L/14 reaches 27.8 and 38.4 mAP$_{r}$.

Modulating CNN Features with Pre-Trained ViT Representations for Open-Vocabulary Object Detection

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理