CL-CLIP: CLIP-Based Continual Learning Framework with Cost-Volume Category Decoupling for Object Detection

作者: Zihan Liu, Yuguang Yang, Shengjie Su, Jianing Pang, Linlin Yang, Chunyu Xie, Nikolai Yu. Zolotykh, Baochang Zhang

分类: cs.CV

发布日期: 2026-06-05

💡 一句话要点

提出CL-CLIP框架以解决持续对象检测中的灾难性遗忘问题

🎯 匹配领域: 支柱三：空间感知与语义 (Perception & Semantics)

关键词: 持续学习 对象检测 开放词汇 CLIP 灾难性遗忘 多专家模型 图像-文本相似性 成本体积

📋 核心要点

现有的CLIP基础开放词汇检测器在持续学习中面临灾难性遗忘的问题，无法有效保留先前学习的类别。
本文提出CL-CLIP框架，通过计算CLIP图像-文本相似性成本体积，实现类别的解耦，从而增强持续学习能力。
在PASCAL VOC和MS-COCO数据集上的实验表明，CL-CLIP在持续微调下显著提升了检测性能，尤其是在新类别的适应性上。

📝 摘要（中文）

持续对象检测（COD）要求检测器在获取新类别的同时保留先前学习的类别。现有的CLIP基础开放词汇检测器在零样本泛化方面表现出色，但在实际应用中，随着新类别的不断引入，它们会遭遇严重的灾难性遗忘。为此，本文提出了CL-CLIP框架，通过成本体积引导的类别解耦，增强了开放词汇检测器的持续学习能力。实验结果表明，CL-CLIP在PASCAL VOC和MS-COCO数据集上显著提升了F-ViT基线的表现，尤其是在适应新引入类别的同时保持了竞争力的基础类别性能。

🔬 方法详解

问题定义：本文旨在解决持续对象检测中，检测器在引入新类别时遭遇的灾难性遗忘问题。现有方法在不断更新的过程中，往往无法有效保留之前学习的类别，导致检测性能下降。

核心思路：CL-CLIP框架的核心思想是通过成本体积引导的类别解耦，利用CLIP模型的图像-文本相似性来增强开放词汇检测器的持续学习能力。这样的设计使得检测器能够在学习新类别时，依然保持对旧类别的良好检测能力。

技术框架：CL-CLIP的整体架构包括计算CLIP图像-文本相似性成本体积，生成类别特定的响应图，并通过多专家RoI头进行处理。该框架的主要模块包括成本体积计算模块和多专家处理模块。

关键创新：CL-CLIP的主要创新在于通过成本体积引导的类别解耦，能够有效地将共享区域特征分解为类别特定的路径。这一方法与现有的持续学习方法相比，显著提升了对新类别的适应能力，同时保持了基础类别的性能。

关键设计：在关键设计方面，CL-CLIP使用了CLIP模型的图像-文本相似性作为成本体积，采用了多专家RoI头来处理不同类别的特征。此外，损失函数的设计也考虑了类别解耦的需求，以确保新旧类别的平衡学习。

🖼️ 关键图片

📊 实验亮点

在PASCAL VOC和MS-COCO数据集上的实验结果显示，CL-CLIP框架在持续微调过程中显著提升了F-ViT基线的性能，尤其在新类别的适应性上表现出色，保持了基础类别的竞争力，具体性能提升幅度未知。

🎯 应用场景

CL-CLIP框架在持续对象检测领域具有广泛的应用潜力，尤其适用于需要实时更新检测类别的场景，如智能监控、自动驾驶和机器人视觉等。其增强的持续学习能力将有助于提高系统的适应性和稳定性，推动相关技术的实际应用与发展。

📄 摘要（原文）

Continual Object Detection (COD) requires a detector to acquire new categories over time while preserving previously learned ones. This goal is closely related to open-vocabulary detection, since both settings require reasoning over categories that are not fully covered by the annotations available at the current training stage. Recent CLIP-based open-vocabulary detectors have shown strong zero-shot generalization, and frameworks such as F-ViT demonstrate that vision-language pretraining can provide powerful zero-shot detection ability for unseen categories. However, real-world deployments cannot remain purely zero-shot: once these detectors are continually updated on newly introduced categories, they suffer severe catastrophic forgetting and quickly lose their previously calibrated detection ability. We therefore propose CL-CLIP, a CLIP-based COD framework that equips open-vocabulary detectors with better continual learning ability through cost-volume-guided category decoupling. Specifically, following CAT-Seg, we compute a CLIP image-text similarity cost volume, defined as dense category-wise response maps between visual tokens and class text embeddings. This zero-shot spatial prior decomposes shared region features into class-specific pathways, which are then processed by a Multi-Expert RoI head. Extensive experiments on PASCAL VOC and MS-COCO show that CL-CLIP substantially improves the F-ViT baseline under continual fine-tuning and achieves competitive performance with existing continual object detectors, especially in adapting to newly introduced categories while preserving competitive base-class performance.

CL-CLIP: CLIP-Based Continual Learning Framework with Cost-Volume Category Decoupling for Object Detection

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理