Learning from Stochastic Teacher Representations Using Student-Guided Knowledge Distillation

作者: Muhammad Haseeb Aslam, Clara Martinez, Marco Pedersoli, Alessandro Koerich, Ali Etemad, Eric Granger

分类: cs.LG

发布日期: 2025-04-19 (更新: 2025-06-23)

💡 一句话要点

提出基于学生引导知识蒸馏的随机自蒸馏方法，提升资源受限场景下的模型性能。

🎯 匹配领域: 支柱二：RL算法与架构 (RL & Architecture)

关键词: 知识蒸馏 自蒸馏 随机dropout 学生引导 资源受限 模型压缩 深度学习

📋 核心要点

现有自蒸馏方法或集成学习存在模型训练和部署成本高昂的问题，尤其是在资源受限场景下。
论文提出一种随机自蒸馏（SSD）方法，通过蒸馏时dropout生成多样教师表示，并利用学生模型引导知识蒸馏。
实验表明，该方法在多个数据集上优于现有方法，且计算复杂度低，模型大小无增加。

📝 摘要（中文）

自蒸馏的进展表明，当使用相同的深度学习（DL）架构将知识从教师传递给学生时，学生性能可以超过教师，尤其是在网络过度参数化且教师使用早停训练时。集成学习也能提高性能，但随着模型数量的增加，训练、存储和部署多个模型变得不切实际。即使将集成模型蒸馏成单个学生模型或权重平均方法，也首先需要训练多个教师模型，并且不能充分利用固有的随机性来生成和提炼DL模型中的多样性。这些限制在可穿戴设备等资源受限或对延迟敏感的应用中尤其令人望而却步。本文提出仅训练一个模型，并使用蒸馏时 dropout 生成多个不同的教师表示。然而，随机生成这些表示会导致与学习任务不一致的噪声表示。为了克服这个问题，引入了一种新颖的随机自蒸馏（SSD）训练策略，用于过滤和加权教师表示，仅从任务相关的表示中进行蒸馏，使用学生引导的知识蒸馏（SGKD）。每个蒸馏步骤中的学生表示被用作指导蒸馏过程的权威。在来自 UCR Archive 的真实情感计算、可穿戴/生物信号数据集、HAR 数据集和图像分类数据集上的实验结果表明，所提出的 SSD 方法可以在不增加训练和测试时模型大小的情况下，优于最先进的方法，并且与最先进的集成学习和权重平均方法相比，计算复杂度可以忽略不计。

🔬 方法详解

问题定义：论文旨在解决资源受限场景下，深度学习模型训练和部署成本高昂的问题。现有自蒸馏方法或集成学习方法，要么需要训练多个模型，要么无法有效利用模型训练过程中的随机性，导致计算和存储开销大，不适用于可穿戴设备等应用。

核心思路：论文的核心思路是利用单个模型的随机性，通过蒸馏时 dropout 生成多个不同的教师表示，并使用学生模型作为引导，过滤和加权教师表示，从而实现高效的知识蒸馏。这样既避免了训练多个模型的开销，又充分利用了模型的内在随机性。

技术框架：整体框架包含以下几个主要步骤： 1. 模型训练：首先训练一个初始的深度学习模型。 2. 随机教师表示生成：在蒸馏过程中，对该模型应用 dropout，生成多个随机的教师表示。 3. 学生引导的知识蒸馏：使用学生模型作为引导，计算每个教师表示与学生表示之间的相似度，并根据相似度对教师表示进行加权。 4. 知识蒸馏：将加权后的教师表示作为目标，训练学生模型。

关键创新：论文的关键创新在于提出了学生引导的知识蒸馏（SGKD）方法。传统的知识蒸馏方法通常直接使用教师模型的输出作为目标，而 SGKD 方法则利用学生模型的信息，动态地选择和加权教师表示，从而提高蒸馏效率和模型性能。此外，利用蒸馏时 dropout 生成多样教师表示也是一个创新点，避免了训练多个教师模型的开销。

关键设计： * Dropout 率：dropout 率的选择会影响教师表示的多样性和噪声水平，需要根据具体任务进行调整。 * 相似度度量：学生表示和教师表示之间的相似度可以使用多种度量方式，如余弦相似度、KL 散度等。 * 加权函数：加权函数用于根据相似度对教师表示进行加权，可以使用 softmax 函数或其他非线性函数。 * 损失函数：损失函数通常包括蒸馏损失（如 KL 散度）和任务损失（如交叉熵损失）。

🖼️ 关键图片

📊 实验亮点

实验结果表明，所提出的 SSD 方法在多个数据集上优于现有的自蒸馏和集成学习方法。例如，在 HAR 数据集上，SSD 方法的性能优于 state-of-the-art 方法，且模型大小没有增加。此外，SSD 方法的计算复杂度远低于集成学习方法，更适合资源受限的场景。

🎯 应用场景

该研究成果可广泛应用于资源受限或对延迟敏感的应用场景，如可穿戴设备、移动设备、物联网设备等。例如，在可穿戴健康监测设备中，可以使用该方法训练高效的生物信号分类模型，从而实现低功耗、高精度的健康监测。此外，该方法还可以应用于边缘计算等领域，提升模型在边缘设备上的性能。

📄 摘要（原文）

Advances in self-distillation have shown that when knowledge is distilled from a teacher to a student using the same deep learning (DL) architecture, the student performance can surpass the teacher particularly when the network is overparameterized and the teacher is trained with early stopping. Alternatively, ensemble learning also improves performance, although training, storing, and deploying multiple models becomes impractical as the number of models grows. Even distilling an ensemble to a single student model or weight averaging methods first requires training of multiple teacher models and does not fully leverage the inherent stochasticity for generating and distilling diversity in DL models. These constraints are particularly prohibitive in resource-constrained or latency-sensitive applications such as wearable devices. This paper proposes to train only one model and generate multiple diverse teacher representations using distillation-time dropout. However, generating these representations stochastically leads to noisy representations that are misaligned with the learned task. To overcome this problem, a novel stochastic self-distillation (SSD) training strategy is introduced for filtering and weighting teacher representation to distill from task-relevant representations only, using student-guided knowledge distillation (SGKD). The student representation at each distillation step is used as authority to guide the distillation process. Experimental results on real-world affective computing, wearable/biosignal datasets from the UCR Archive, the HAR dataset, and image classification datasets show that the proposed SSD method can outperform state-of-the-art methods without increasing the model size at both training and testing time, and incurs negligible computational complexity compared to state-of-the-art ensemble learning and weight averaging methods.

Learning from Stochastic Teacher Representations Using Student-Guided Knowledge Distillation

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理