DeiT-LT Distillation Strikes Back for Vision Transformer Training on Long-Tailed Datasets

📄 arXiv: 2404.02900v1 📥 PDF

作者: Harsh Rangwani, Pradipto Mondal, Mayank Mishra, Ashish Ramayee Asokan, R. Venkatesh Babu

分类: cs.CV, cs.AI, cs.LG

发布日期: 2024-04-03

备注: CVPR 2024. Project Page: https://rangwani-harsh.github.io/DeiT-LT


💡 一句话要点

提出DeiT-LT以解决长尾数据集上ViT训练问题

🎯 匹配领域: 支柱二:RL算法与架构 (RL & Architecture)

关键词: 长尾数据集 视觉变换器 蒸馏训练 特征学习 计算机视觉

📋 核心要点

  1. 现有的视觉变换器(ViT)在长尾数据集上训练时缺乏有效的特征学习,导致尾类泛化能力不足。
  2. 本文提出DeiT-LT,通过引入CNN的蒸馏DIST标记和重新加权损失,专注于尾类特征的学习。
  3. 实验结果显示,DeiT-LT在多个数据集上显著提升了ViT的性能,证明了其有效性。

📝 摘要(中文)

视觉变换器(ViT)已成为计算机视觉任务中的重要架构,但在长尾数据集上训练时面临挑战。本文提出DeiT-LT,通过引入CNN的蒸馏DIST标记,利用分布外图像和重新加权蒸馏损失,增强对尾类的关注,从而在ViT的早期块中学习局部CNN特征,改善尾类的泛化能力。此外,采用平坦CNN教师进行蒸馏以减轻过拟合,学习低秩可泛化特征。实验表明,DeiT-LT在从小规模CIFAR-10 LT到大规模iNaturalist-2018的数据集上有效提升了ViT的训练效果。

🔬 方法详解

问题定义:本文旨在解决在长尾数据集上训练视觉变换器(ViT)时的特征学习不足问题。现有方法在处理尾类样本时表现不佳,导致模型泛化能力不足。

核心思路:论文提出DeiT-LT,通过引入CNN的蒸馏DIST标记,利用分布外图像和重新加权蒸馏损失,增强对尾类的关注,从而改善ViT的特征学习能力。

技术框架:DeiT-LT的整体架构包括输入图像的分块处理、通过自注意力机制的特征提取、以及通过DIST标记进行的蒸馏过程。该方法通过平坦CNN教师进行蒸馏,学习低秩可泛化特征。

关键创新:最重要的技术创新在于引入DIST标记和对蒸馏损失的重新加权,使得模型能够在同一架构中有效学习头类和尾类的特征。与现有方法相比,DeiT-LT在特征学习上更具针对性。

关键设计:在损失函数中,针对尾类样本的损失进行了重新加权,以提高模型对这些样本的关注度。此外,采用平坦CNN教师进行蒸馏,确保学习到的特征具有较好的泛化能力。整体网络结构保持ViT的基本框架,但通过DIST标记的引入实现了更好的特征学习。

📊 实验亮点

实验结果表明,DeiT-LT在小规模CIFAR-10 LT和大规模iNaturalist-2018数据集上均显著提升了ViT的性能,尤其是在尾类样本的分类准确率上,较基线方法提升幅度达到XX%。

🎯 应用场景

该研究的潜在应用领域包括图像分类、目标检测及其他计算机视觉任务,尤其是在数据分布不均的场景下。DeiT-LT的有效性为长尾数据集的处理提供了新的思路,未来可能推动更多基于ViT的应用开发,提升模型在实际应用中的表现。

📄 摘要(原文)

Vision Transformer (ViT) has emerged as a prominent architecture for various computer vision tasks. In ViT, we divide the input image into patch tokens and process them through a stack of self attention blocks. However, unlike Convolutional Neural Networks (CNN), ViTs simple architecture has no informative inductive bias (e.g., locality,etc. ). Due to this, ViT requires a large amount of data for pre-training. Various data efficient approaches (DeiT) have been proposed to train ViT on balanced datasets effectively. However, limited literature discusses the use of ViT for datasets with long-tailed imbalances. In this work, we introduce DeiT-LT to tackle the problem of training ViTs from scratch on long-tailed datasets. In DeiT-LT, we introduce an efficient and effective way of distillation from CNN via distillation DIST token by using out-of-distribution images and re-weighting the distillation loss to enhance focus on tail classes. This leads to the learning of local CNN-like features in early ViT blocks, improving generalization for tail classes. Further, to mitigate overfitting, we propose distilling from a flat CNN teacher, which leads to learning low-rank generalizable features for DIST tokens across all ViT blocks. With the proposed DeiT-LT scheme, the distillation DIST token becomes an expert on the tail classes, and the classifier CLS token becomes an expert on the head classes. The experts help to effectively learn features corresponding to both the majority and minority classes using a distinct set of tokens within the same ViT architecture. We show the effectiveness of DeiT-LT for training ViT from scratch on datasets ranging from small-scale CIFAR-10 LT to large-scale iNaturalist-2018.