$\mathbb{X}$-Sample Contrastive Loss: Improving Contrastive Learning with Sample Similarity Graphs
作者: Vlad Sobal, Mark Ibrahim, Randall Balestriero, Vivien Cabannes, Diane Bouchacourt, Pietro Astolfi, Kyunghyun Cho, Yann LeCun
分类: cs.CV, cs.LG
发布日期: 2024-07-25 (更新: 2024-09-11)
💡 一句话要点
提出$ extbf{X}$-样本对比损失以改善对比学习
🎯 匹配领域: 支柱二:RL算法与架构 (RL & Architecture) 支柱九:具身大模型 (Embodied Foundation Models)
关键词: 对比学习 样本相似性 自监督学习 视觉模型 多模态学习
📋 核心要点
- 现有对比学习方法的相似性图仅为二元,忽视了样本间的相似性,导致表示学习的不足。
- 本文提出$ extbf{X}$-样本对比损失,通过显式编码样本间的关系,改善对比学习的效果。
- 实验结果表明,所提方法在多个数据集上超越了CLIP等基线,尤其在低数据场景下提升显著。
📝 摘要(中文)
学习良好的表示需要捕捉数据样本之间的多样关系。对比损失作为一种目标,旨在匹配相关样本,但其局限在于相似性图是二元的,仅考虑一个正样本,忽略了样本间的相似性。基于此观察,本文提出$ extbf{X}$-样本对比损失,明确编码样本之间的关系。通过在ImageNet-1k、CC3M和CC12M等数据集上进行实验,所提出的目标在多项任务上超越了对比自监督和视觉-语言模型,尤其在低数据场景下表现优异,推动了基础模型对样本关系的理解。
🔬 方法详解
问题定义:现有对比学习方法主要依赖于二元相似性图,导致样本间的相似性未被充分利用,从而影响表示学习的效果。
核心思路:本文提出的$ extbf{X}$-样本对比损失通过显式编码样本间的关系,允许模型在嵌入空间中更好地捕捉样本之间的相似性,从而提升学习效果。
技术框架:整体架构包括样本相似性图的构建、对比损失的计算和模型训练三个主要阶段。首先,构建样本间的相似性图,然后基于该图计算新的对比损失,最后通过优化该损失来训练模型。
关键创新:最重要的创新在于将样本间的相似性纳入对比损失的计算中,打破了传统方法仅依赖单一正样本的局限,提升了模型对样本关系的理解。
关键设计:在损失函数的设计上,本文引入了样本相似性图的构建方法,并在训练过程中动态调整样本间的相似性权重,以优化模型的学习过程。具体的参数设置和网络结构细节在实验部分进行了详细描述。
🖼️ 关键图片
📊 实验亮点
实验结果显示,在CC12M数据集上,所提方法在ImageNet和ImageNet Real上分别超越CLIP $0.6\%$,在低数据场景下,使用CC3M训练时,提升幅度达到$16.8\%$和$18.1\%$,在ImageNet9上也取得了$3.3$-$5.6\%$的提升,验证了方法的有效性。
🎯 应用场景
该研究的潜在应用领域包括计算机视觉、自然语言处理和多模态学习等。通过改善对比学习的效果,能够推动图像识别、文本生成等任务的性能提升,具有重要的实际价值和广泛的未来影响。
📄 摘要(原文)
Learning good representations involves capturing the diverse ways in which data samples relate. Contrastive loss - an objective matching related samples - underlies methods from self-supervised to multimodal learning. Contrastive losses, however, can be viewed more broadly as modifying a similarity graph to indicate how samples should relate in the embedding space. This view reveals a shortcoming in contrastive learning: the similarity graph is binary, as only one sample is the related positive sample. Crucially, similarities \textit{across} samples are ignored. Based on this observation, we revise the standard contrastive loss to explicitly encode how a sample relates to others. We experiment with this new objective, called $\mathbb{X}$-Sample Contrastive, to train vision models based on similarities in class or text caption descriptions. Our study spans three scales: ImageNet-1k with 1 million, CC3M with 3 million, and CC12M with 12 million samples. The representations learned via our objective outperform both contrastive self-supervised and vision-language models trained on the same data across a range of tasks. When training on CC12M, we outperform CLIP by $0.6\%$ on both ImageNet and ImageNet Real. Our objective appears to work particularly well in lower-data regimes, with gains over CLIP of $16.8\%$ on ImageNet and $18.1\%$ on ImageNet Real when training with CC3M. Finally, our objective seems to encourage the model to learn representations that separate objects from their attributes and backgrounds, with gains of $3.3$-$5.6$\% over CLIP on ImageNet9. We hope the proposed solution takes a small step towards developing richer learning objectives for understanding sample relations in foundation models.