CLIPLoss and Norm-Based Data Selection Methods for Multimodal Contrastive Learning

作者: Yiping Wang, Yifang Chen, Wendan Yan, Alex Fang, Wenjing Zhou, Kevin Jamieson, Simon Shaolei Du

分类: cs.LG, cs.CV

发布日期: 2024-05-29 (更新: 2024-12-20)

备注: This paper supercedes our previous VAS paper (arXiv:2402.02055). It's accepted by NeurIPS2024 as spotlight paper. DataComp benchmark: https://www.datacomp.ai/dcclip/leaderboard.html

💡 一句话要点

提出s-CLIPLoss和NormSim，提升多模态对比学习中数据选择的性能。

🎯 匹配领域: 支柱二：RL算法与架构 (RL & Architecture) 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 多模态对比学习 数据选择 CLIP 数据质量评估 预训练 视觉-语言模型 NormSim s-CLIPLoss

📋 核心要点

大规模视觉-语言模型预训练面临数据噪声问题，现有方法依赖外部模型或特定CLIP模型，通用性不足。
提出s-CLIPLoss，通过对比样本对齐增强质量评估；提出NormSim，利用范数衡量预训练与目标数据相似性。
实验表明，新方法在ImageNet-1k和下游任务上均有显著提升，且能与现有最佳方法结合，进一步提高性能。

📝 摘要（中文）

数据选择已成为大规模视觉-语言模型预训练（如CLIP）的核心问题，尤其是在使用噪声网络数据集时。现有的数据选择方法主要有三种：（1）利用外部非CLIP模型辅助数据选择；（2）训练新的CLIP风格嵌入模型，使其比原始OpenAI CLIP模型更有效地选择高质量数据；（3）设计更通用的指标或策略，适用于任何CLIP嵌入，而无需特定的模型属性（例如，CLIPScore是一种流行的指标）。本文主要研究第三种方法，并提出了两种新方法。首先，我们引入了surrogate-CLIPLoss（s-CLIPLoss），它受到CLIP损失的启发，增加了一个样本与其对比对之间的对齐作为额外的归一化项，以更好地衡量质量。其次，当已知下游任务时，我们提出了一种新的基于范数的指标NormSim，用于衡量预训练数据和目标数据之间的相似性。我们在DataComp数据集上测试了我们的方法。与仅使用OpenAI的CLIP-L/14的最佳基线相比，我们的方法在ImageNet-1k上实现了5.3％的改进，在38个下游评估任务上实现了2.8％的改进。此外，s-CLIPLoss和NormSim均与现有技术兼容。通过将我们的方法与当前最佳方法DFN和HYPE相结合，我们可以将下游任务的平均性能提高0.9％，从而在DataComp-medium基准上实现新的state-of-the-art。

🔬 方法详解

问题定义：论文旨在解决多模态对比学习中，如何从大规模噪声数据集中选择高质量数据进行预训练的问题。现有方法要么依赖于外部模型，增加了计算负担和依赖性；要么需要训练新的CLIP风格模型，成本较高；要么提出的通用指标效果提升有限。因此，如何设计一种更有效、更通用的数据选择方法，是本文要解决的核心问题。

核心思路：论文的核心思路是设计新的数据质量评估指标，这些指标可以直接应用于现有的CLIP模型，而无需额外的模型训练或外部依赖。s-CLIPLoss通过引入对比样本之间的对齐信息，增强了对数据质量的判断能力。NormSim则通过衡量预训练数据和下游任务数据的相似性，选择更适合下游任务的数据。

技术框架：整体框架包括两个主要模块：s-CLIPLoss计算模块和NormSim计算模块。s-CLIPLoss模块基于CLIP的损失函数，增加了对比样本对齐的归一化项。NormSim模块计算预训练数据和下游任务数据在特征空间中的范数相似度。这两个模块可以独立使用，也可以结合使用。数据选择流程为：首先使用CLIP模型提取图像和文本的嵌入向量，然后使用s-CLIPLoss或NormSim计算每个样本的质量得分，最后根据得分选择高质量的数据子集。

关键创新：最重要的技术创新点在于提出了s-CLIPLoss和NormSim两种新的数据质量评估指标。s-CLIPLoss与传统CLIP score的区别在于，它考虑了对比样本之间的关系，从而能够更准确地评估数据的质量。NormSim则利用范数相似度来衡量预训练数据和下游任务数据的相关性，从而可以选择更适合下游任务的数据。

关键设计：s-CLIPLoss的关键设计在于引入了对比样本对齐的归一化项。具体来说，对于一个样本(image, text)，s-CLIPLoss不仅考虑image和text之间的对齐程度，还考虑image和batch中其他text的对齐程度，以及text和batch中其他image的对齐程度。NormSim的关键设计在于使用范数来衡量数据的相似性。具体来说，对于预训练数据和下游任务数据，首先使用CLIP模型提取它们的嵌入向量，然后计算这些向量的范数，最后使用余弦相似度来衡量它们的范数相似度。

🖼️ 关键图片

📊 实验亮点

实验结果表明，s-CLIPLoss和NormSim在DataComp数据集上取得了显著的性能提升。与仅使用OpenAI的CLIP-L/14的最佳基线相比，该方法在ImageNet-1k上实现了5.3％的改进，在38个下游评估任务上实现了2.8％的改进。此外，将该方法与当前最佳方法DFN和HYPE相结合，可以将下游任务的平均性能提高0.9％，从而在DataComp-medium基准上实现新的state-of-the-art。

🎯 应用场景

该研究成果可广泛应用于多模态预训练领域，尤其是在数据质量参差不齐的情况下。通过选择高质量的数据进行预训练，可以显著提高模型的性能和泛化能力。此外，该方法还可以应用于数据清洗、数据增强等领域，提高数据利用率和模型训练效率。未来，该研究可以进一步扩展到其他模态的数据选择，例如音频、视频等。

📄 摘要（原文）

Data selection has emerged as a core issue for large-scale visual-language model pretaining (e.g., CLIP), particularly with noisy web-curated datasets. Three main data selection approaches are: (1) leveraging external non-CLIP models to aid data selection, (2) training new CLIP-style embedding models that are more effective at selecting high-quality data than the original OpenAI CLIP model, and (3) designing better metrics or strategies universally applicable to any CLIP embedding without requiring specific model properties (e.g., CLIPScore is one popular metric). While the first two approaches have been extensively studied, the third remains under-explored. In this paper, we advance the third approach by proposing two new methods. Firstly, instead of classical CLIP scores that only consider the alignment between two modalities from a single sample, we introduce surrogate-CLIPLoss (s-CLIPLoss), a CLIP loss-inspired method that adds the alignment between one sample and its contrastive pairs as an extra normalization term for better quality measurement. Secondly, when downstream tasks are known, we propose a new norm-based metric, NormSim, to measure the similarity between pretraining data and target data. We test our methods on the data selection benchmark, DataComp~\cite{gadre2023datacomp}. Compared to the best baseline using only OpenAI's CLIP-L/14, our methods achieve a 5.3\% improvement on ImageNet-1k and a 2.8\% improvement on 38 downstream evaluation tasks. Moreover, both s-CLIPLoss and NormSim are compatible with existing techniques. By combining our methods with the current best methods DFN and HYPE, we can boost average performance on downstream tasks by 0.9\%, achieving a new state-of-the-art on the DataComp-medium benchmark.

CLIPLoss and Norm-Based Data Selection Methods for Multimodal Contrastive Learning

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理