ViTamin: Designing Scalable Vision Models in the Vision-Language Era

📄 arXiv: 2404.02132v2 📥 PDF

作者: Jieneng Chen, Qihang Yu, Xiaohui Shen, Alan Yuille, Liang-Chieh Chen

分类: cs.CV

发布日期: 2024-04-02 (更新: 2024-04-03)

备注: CVPR 2024; https://github.com/Beckschen/ViTamin


💡 一句话要点

提出ViTamin以提升视觉语言模型的性能与可扩展性

🎯 匹配领域: 支柱三:空间感知与语义 (Perception & Semantics)

关键词: 视觉语言模型 视觉变换器 对比学习 多模态模型 零-shot学习 模型可扩展性 特征嵌入

📋 核心要点

  1. 现有的视觉语言模型在图像编码方面仍依赖传统的视觉变换器,缺乏对新型网络架构的深入研究。
  2. 本文提出ViTamin,一个专为视觉语言模型设计的视觉模型,旨在提升模型的零-shot性能和可扩展性。
  3. ViTamin-L在ImageNet零-shot准确率上超越ViT-L 2.0%,并在60个多样化基准测试中表现出色。

📝 摘要(中文)

近年来,视觉语言模型(VLMs)的突破为视觉领域带来了新的发展。VLMs通过在大规模互联网图像-文本对上训练,提供了比ImageNet预训练模型更强大且更具泛化能力的特征嵌入。然而,尽管VLMs取得了显著成就,传统的视觉变换器(ViTs)仍然是图像编码的默认选择。本文旨在建立一个在对比语言-图像预训练(CLIP)框架下评估视觉模型的协议,并提出了针对VLMs的新视觉模型ViTamin。实验结果表明,ViTamin-L在多个基准测试中表现优异,尤其在ImageNet零-shot准确率上超越了ViT-L。

🔬 方法详解

问题定义:本文解决了在视觉语言模型中,传统视觉变换器在图像编码方面的局限性,尤其是缺乏对新型网络架构的研究。

核心思路:提出ViTamin模型,旨在通过更好的设计和训练策略提升视觉模型在视觉语言任务中的表现,特别是在零-shot学习场景下。

技术框架:ViTamin采用了对比语言-图像预训练(CLIP)框架,包含多个模块以评估不同视觉模型的性能,关注模型和训练数据的可扩展性。

关键创新:ViTamin模型在设计上针对VLMs进行了优化,显著提升了在多个基准测试中的表现,尤其是在参数较少的情况下实现了更高的准确率。

关键设计:ViTamin-L和ViT-L使用相同的DataComp-1B数据集和OpenCLIP训练方案,ViTamin-XL在仅436M参数下实现82.9%的ImageNet零-shot准确率,超越了参数十倍的EVA-E模型。

📊 实验亮点

ViTamin-L在ImageNet零-shot准确率上超越ViT-L 2.0%,并在60个基准测试中表现优异。ViTamin-XL在仅436M参数下实现82.9%的准确率,显著高于EVA-E的82.0%,后者参数量为4.4B。

🎯 应用场景

该研究的潜在应用领域包括图像分类、检索、开放词汇检测和分割等多模态任务。ViTamin模型的设计和评估方法为未来视觉语言模型的开发提供了新的思路,具有重要的实际价值和影响力。

📄 摘要(原文)

Recent breakthroughs in vision-language models (VLMs) start a new page in the vision community. The VLMs provide stronger and more generalizable feature embeddings compared to those from ImageNet-pretrained models, thanks to the training on the large-scale Internet image-text pairs. However, despite the amazing achievement from the VLMs, vanilla Vision Transformers (ViTs) remain the default choice for the image encoder. Although pure transformer proves its effectiveness in the text encoding area, it remains questionable whether it is also the case for image encoding, especially considering that various types of networks are proposed on the ImageNet benchmark, which, unfortunately, are rarely studied in VLMs. Due to small data/model scale, the original conclusions of model design on ImageNet can be limited and biased. In this paper, we aim at building an evaluation protocol of vision models in the vision-language era under the contrastive language-image pretraining (CLIP) framework. We provide a comprehensive way to benchmark different vision models, covering their zero-shot performance and scalability in both model and training data sizes. To this end, we introduce ViTamin, a new vision models tailored for VLMs. ViTamin-L significantly outperforms ViT-L by 2.0% ImageNet zero-shot accuracy, when using the same publicly available DataComp-1B dataset and the same OpenCLIP training scheme. ViTamin-L presents promising results on 60 diverse benchmarks, including classification, retrieval, open-vocabulary detection and segmentation, and large multi-modal models. When further scaling up the model size, our ViTamin-XL with only 436M parameters attains 82.9% ImageNet zero-shot accuracy, surpassing 82.0% achieved by EVA-E that has ten times more parameters (4.4B).