Balance Divergence for Knowledge Distillation

作者: Yafei Qi, Chen Wang, Zhaoning Zhang, Yaping Liu, Yongmin Zhang

分类: cs.CV

发布日期: 2025-01-14

💡 一句话要点

提出平衡散度蒸馏，解决知识蒸馏中负知识利用不足的问题。

🎯 匹配领域: 支柱二：RL算法与架构 (RL & Architecture)

关键词: 知识蒸馏 模型压缩 暗知识 KL散度 反向KL散度 计算机视觉 模型优化

📋 核心要点

现有知识蒸馏方法使用KL散度，忽略了教师网络输出中概率极小的负知识，导致学生网络学习不充分。
提出平衡散度蒸馏，利用反向KL散度补偿负知识，平衡正负知识的学习，提升学生网络性能。
在图像分类和语义分割任务上，该方法使轻量级学生网络在精度和mIoU上分别提升了1%-3%和4.55%。

📝 摘要（中文）

知识蒸馏已被广泛应用于计算机视觉任务中，因为它可以通过利用从笨重的教师网络传递的知识来有效地提高轻量级学生网络的性能。大多数现有的知识蒸馏方法利用 Kullback-Leibler 散度来模仿教师网络和学生网络之间的 logit 输出概率。然而，这些方法可能会忽略教师“暗知识”的负面部分，因为散度计算可能会忽略教师 logit 输出中极小概率的影响。这种缺陷可能导致蒸馏过程中 logit 模仿的次优性能，并导致学生网络获得的信息不平衡。在本文中，我们研究了这种不平衡的影响，并提出了一种名为平衡散度蒸馏的新方法。通过引入使用反向 Kullback-Leibler 散度的补偿操作，我们的方法可以改进教师负面信息中极小值的建模，并保持对正面信息的学习能力。此外，我们测试了不同温度系数调整的影响，这可以进一步平衡知识转移。我们在包括图像分类和语义分割在内的多个计算机视觉任务上评估了所提出的方法。评估结果表明，我们的方法在 CIFAR-100 和 ImageNet 数据集上，轻量级学生的准确率提高了 1%~3%，在 Cityscapes 数据集上，PSP-ResNet18 的 mIoU 提高了 4.55%。实验表明，我们的方法是一种简单而高效的解决方案，可以顺利地应用于不同的知识蒸馏方法。

🔬 方法详解

问题定义：现有的知识蒸馏方法主要依赖于Kullback-Leibler (KL) 散度来衡量教师网络和学生网络输出之间的差异。然而，KL散度对小概率事件不敏感，导致学生网络难以学习教师网络中的“暗知识”，特别是那些概率极低的负面信息。这种信息不平衡会限制学生网络的性能，使其无法充分模仿教师网络的行为。

核心思路：论文的核心思路是通过引入反向KL散度来补偿传统KL散度的不足。反向KL散度对小概率事件更加敏感，可以有效地捕捉教师网络输出中被忽略的负面信息。通过结合正向和反向KL散度，可以平衡学生网络对正负知识的学习，从而提高其性能。

技术框架：平衡散度蒸馏方法可以嵌入到现有的知识蒸馏框架中。其主要流程包括：首先，使用教师网络和学生网络对输入数据进行前向传播；然后，计算教师网络和学生网络输出之间的正向KL散度，衡量学生网络对教师网络主要知识的模仿程度；接着，计算反向KL散度，用于补偿学生网络对教师网络负面知识的忽略；最后，将正向和反向KL散度加权求和，得到最终的损失函数，用于训练学生网络。

关键创新：该方法最重要的创新点在于引入了反向KL散度来平衡知识蒸馏过程中的信息传递。与传统的只关注正向KL散度的方法相比，平衡散度蒸馏能够更全面地利用教师网络的知识，特别是那些被忽略的负面信息。这种平衡机制可以有效地提高学生网络的泛化能力和鲁棒性。

关键设计：该方法的关键设计包括：1) 反向KL散度的权重系数，用于控制负面知识的补偿程度；2) 温度系数的调整，用于平滑教师网络和学生网络的输出概率分布，从而影响知识转移的效果；3) 损失函数的加权方式，用于平衡正向和反向KL散度的贡献。

🖼️ 关键图片

📊 实验亮点

实验结果表明，在CIFAR-100和ImageNet数据集上，使用平衡散度蒸馏方法可以使轻量级学生网络的准确率提高1%~3%。在Cityscapes数据集上，使用PSP-ResNet18作为学生网络，该方法的mIoU提高了4.55%。这些结果表明，平衡散度蒸馏是一种简单而有效的知识蒸馏方法，可以显著提高学生网络的性能。

🎯 应用场景

该研究成果可广泛应用于各种需要模型轻量化和加速的计算机视觉任务中，例如移动设备上的图像识别、自动驾驶中的目标检测、以及视频监控中的行为分析等。通过知识蒸馏，可以将大型、复杂的模型压缩成小型、高效的模型，从而在资源受限的设备上实现高性能的视觉处理。该方法还有助于提高模型的鲁棒性和泛化能力，使其在实际应用中更加可靠。

📄 摘要（原文）

Knowledge distillation has been widely adopted in computer vision task processing, since it can effectively enhance the performance of lightweight student networks by leveraging the knowledge transferred from cumbersome teacher networks. Most existing knowledge distillation methods utilize Kullback-Leibler divergence to mimic the logit output probabilities between the teacher network and the student network. Nonetheless, these methods may neglect the negative parts of the teacher's ''dark knowledge'' because the divergence calculations may ignore the effect of the minute probabilities from the teacher's logit output. This deficiency may lead to suboptimal performance in logit mimicry during the distillation process and result in an imbalance of information acquired by the student network. In this paper, we investigate the impact of this imbalance and propose a novel method, named Balance Divergence Distillation. By introducing a compensatory operation using reverse Kullback-Leibler divergence, our method can improve the modeling of the extremely small values in the negative from the teacher and preserve the learning capacity for the positive. Furthermore, we test the impact of different temperature coefficients adjustments, which may conducted to further balance for knowledge transferring. We evaluate the proposed method on several computer vision tasks, including image classification and semantic segmentation. The evaluation results show that our method achieves an accuracy improvement of 1%~3% for lightweight students on both CIFAR-100 and ImageNet dataset, and a 4.55% improvement in mIoU for PSP-ResNet18 on the Cityscapes dataset. The experiments show that our method is a simple yet highly effective solution that can be smoothly applied to different knowledge distillation methods.

Balance Divergence for Knowledge Distillation

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理