On the Surprising Efficacy of Distillation as an Alternative to Pre-Training Small Models
作者: Sean Farhat, Deming Chen
分类: cs.LG, cs.AI
发布日期: 2024-04-04 (更新: 2024-05-03)
备注: ICLR 2024. 5th Workshop on Practical ML for Low Resource Settings (PML4LRS). Code can be found at https://github.com/sfarhat/dapt
💡 一句话要点
提出蒸馏方法以替代小模型的预训练过程
🎯 匹配领域: 支柱二:RL算法与架构 (RL & Architecture)
关键词: 知识蒸馏 小模型 对比学习 预训练 生成模型 数据增强 计算效率
📋 核心要点
- 现有的小模型通常依赖于预训练以获得良好性能,但预训练过程成本高且耗时。
- 论文提出通过蒸馏预训练教师模型的知识,使小模型在特定任务上获得与预训练相似的性能,且计算成本更低。
- 实验结果表明,使用该方法的小模型在性能上可与经过预训练的模型相媲美,且训练速度提高了94%。
📝 摘要(中文)
本文提出小模型无需承担预训练的成本即可获得其好处的观点。通过对预训练教师模型进行蒸馏,小模型在特定任务上的表现可以达到或超过其经过预训练和微调后的效果。我们将知识蒸馏与现代对比学习建立联系,允许不同架构的模型进行蒸馏,并且可以应用基于噪声对比估计理论的对比学习算法。我们展示了这一方法在开源模型上的有效性,并提出了一种新的蒸馏算法,强调了低计算成本的优势。尽管在数据有限的任务中这一现象不明显,但通过利用大型预训练生成模型进行数据集增强,可以改善这一情况。我们的训练方法比标准预训练方法快94%,而不牺牲性能。
🔬 方法详解
问题定义:本文旨在解决小模型在特定任务中依赖预训练的高成本问题。现有方法通常需要耗费大量资源进行预训练,限制了小模型的应用。
核心思路:论文提出小模型可以通过蒸馏预训练教师模型的知识,直接在特定任务上获得良好性能,避免了预训练的复杂性和高成本。
技术框架:整体架构包括预训练教师模型、蒸馏过程和小模型的训练。蒸馏过程将教师模型的知识转移到小模型中,利用对比学习的框架来优化蒸馏目标。
关键创新:最重要的创新在于将知识蒸馏与对比学习相结合,允许不同架构的模型进行有效蒸馏,并且能够应用多种对比学习算法。
关键设计:在蒸馏过程中,采用了Wang & Isola (2020) 提出的对齐/均匀性视角作为蒸馏目标,设计了低计算成本的蒸馏算法,并利用大型生成模型进行数据增强以提升小模型的性能。
🖼️ 关键图片
📊 实验亮点
实验结果显示,使用蒸馏方法的小模型在特定任务上性能可与经过预训练的模型相媲美,且训练速度提高了94%。这一成果为小模型的训练提供了新的思路,尤其是在数据有限的情况下。
🎯 应用场景
该研究的潜在应用领域包括自然语言处理、计算机视觉等需要小模型的任务,尤其是在资源受限的环境中。通过降低预训练的需求,研究为小模型的广泛应用提供了新的可能性,未来可能推动更多轻量级模型的开发与应用。
📄 摘要(原文)
In this paper, we propose that small models may not need to absorb the cost of pre-training to reap its benefits. Instead, they can capitalize on the astonishing results achieved by modern, enormous models to a surprising degree. We observe that, when distilled on a task from a pre-trained teacher model, a small model can achieve or surpass the performance it would achieve if it was pre-trained then finetuned on that task. To allow this phenomenon to be easily leveraged, we establish a connection reducing knowledge distillation to modern contrastive learning, opening two doors: (1) vastly different model architecture pairings can work for the distillation, and (2) most contrastive learning algorithms rooted in the theory of Noise Contrastive Estimation can be easily applied and used. We demonstrate this paradigm using pre-trained teacher models from open-source model hubs, Transformer and convolution based model combinations, and a novel distillation algorithm that massages the Alignment/Uniformity perspective of contrastive learning by Wang & Isola (2020) into a distillation objective. We choose this flavor of contrastive learning due to its low computational cost, an overarching theme of this work. We also observe that this phenomenon tends not to occur if the task is data-limited. However, this can be alleviated by leveraging yet another scale-inspired development: large, pre-trained generative models for dataset augmentation. Again, we use an open-source model, and our rudimentary prompts are sufficient to boost the small model`s performance. Thus, we highlight a training method for small models that is up to 94% faster than the standard pre-training paradigm without sacrificing performance. For practitioners discouraged from fully utilizing modern foundation datasets for their small models due to the prohibitive scale, we believe our work keeps that door open.