Semantic-guided Fine-tuning of Foundation Model for Long-tailed Visual Recognition

📄 arXiv: 2507.12807v1 📥 PDF

作者: Yufei Peng, Yonggang Zhang, Yiu-ming Cheung

分类: cs.CV

发布日期: 2025-07-17


💡 一句话要点

提出语义引导的基础模型微调方法以解决长尾视觉识别问题

🎯 匹配领域: 支柱九:具身大模型 (Embodied Foundation Models)

关键词: 长尾视觉识别 基础模型 语义引导 多模态学习 微调方法 注意力机制 损失函数

📋 核心要点

  1. 现有的长尾视觉识别方法在处理样本数量不均的类别时,性能提升有限,特别是对于尾部类别。
  2. 本文提出的Sage方法通过引入语义指导,利用SG-Adapter增强视觉编码器的微调过程,改善视觉与文本模态的对齐。
  3. 实验结果显示,Sage在多个基准数据集上显著提升了尾部类别的识别性能,验证了方法的有效性。

📝 摘要(中文)

在长尾场景中,类别样本数量的差异常常导致较少频繁类别的性能下降。基础模型在大规模开放世界数据集上预训练,展现出强大的潜力。现有的微调方法通常只调整视觉编码器,而忽视了来自冻结文本编码器的语义信息。为此,本文提出了一种新方法——语义引导的基础模型微调(Sage),通过引入来自文本模态的语义指导来增强视觉微调过程。我们设计了SG-Adapter,将类别描述作为语义指导,利用注意力机制使模型更关注语义相关内容。此外,针对现有损失函数忽略的类别条件分布不一致问题,提出了一种新的分布不匹配补偿因子,以修正预测偏差。大量实验表明,Sage在长尾学习中显著提升了性能。

🔬 方法详解

问题定义:本文旨在解决长尾视觉识别中类别样本数量不均导致的性能下降问题。现有方法往往忽视了视觉和文本模态之间的语义对齐,导致微调效果不佳。

核心思路:提出的Sage方法通过引入语义指导,利用SG-Adapter将类别描述融入视觉编码器的微调过程,以增强视觉和文本之间的对齐。这样的设计使得模型能够更好地关注与任务相关的语义信息。

技术框架:Sage的整体架构包括SG-Adapter模块,该模块通过注意力机制将文本模态的语义信息传递给视觉编码器。此外,针对类别条件分布不一致的问题,设计了新的损失函数,整合了分布不匹配补偿因子。

关键创新:Sage的核心创新在于引入了语义指导机制和分布不匹配补偿因子,这与传统的微调方法有本质区别,后者通常只关注视觉特征的调整。

关键设计:在SG-Adapter中,类别描述通过注意力机制进行处理,确保模型关注语义相关内容。同时,损失函数中加入了补偿因子,以修正由于类别条件分布不一致造成的预测偏差。具体的参数设置和网络结构细节在实验部分进行了详细说明。

🖼️ 关键图片

fig_0
fig_1
fig_2

📊 实验亮点

在多个基准数据集上的实验结果表明,Sage方法在尾部类别的识别性能上比传统方法提升了显著的幅度,具体提升幅度达到XX%,验证了其在长尾学习中的有效性。

🎯 应用场景

该研究的潜在应用领域包括智能监控、自动驾驶、医疗影像分析等长尾视觉识别任务。通过提升尾部类别的识别性能,能够在实际应用中实现更高的准确性和可靠性,推动相关领域的技术进步。

📄 摘要(原文)

The variance in class-wise sample sizes within long-tailed scenarios often results in degraded performance in less frequent classes. Fortunately, foundation models, pre-trained on vast open-world datasets, demonstrate strong potential for this task due to their generalizable representation, which promotes the development of adaptive strategies on pre-trained models in long-tailed learning. Advanced fine-tuning methods typically adjust visual encoders while neglecting the semantics derived from the frozen text encoder, overlooking the visual and textual alignment. To strengthen this alignment, we propose a novel approach, Semantic-guided fine-tuning of foundation model for long-tailed visual recognition (Sage), which incorporates semantic guidance derived from textual modality into the visual fine-tuning process. Specifically, we introduce an SG-Adapter that integrates class descriptions as semantic guidance to guide the fine-tuning of the visual encoder. The introduced guidance is passesed through the attention mechanism and enables the model to focus more on semantically relevant content, strengthening the alignment between the visual and textual modalities. Due to the inconsistent class-conditional distributions neglected by the existing loss function, the resulting prediction bias causes performance improvements for the tail class less than for the head class, even when the multi-modal alignment is enhanced. To address this challenge, we propose a novel distribution mismatch-aware compensation factor, which is specifically designed to rectify the prediction bias caused by the ignored inconsistent distribution based on our theoretical analysis, and is seamlessly integrated into the loss function. Extensive experiments on benchmark datasets demonstrate the effectiveness of the proposed Sage in enhancing performance in long-tailed learning.