Adaptive Teacher Exposure for Self-Distillation in LLM Reasoning

作者: Zihao Han, Tiangang Zhang, Huaibin Wang, Yilun Sun

分类: cs.AI, cs.CL, cs.LO

发布日期: 2026-05-12

备注: 11 pages, 4 figures; code not released yet

💡 一句话要点

提出自适应教师暴露的自蒸馏方法ATESD，提升LLM推理能力

🎯 匹配领域: 支柱二：RL算法与架构 (RL & Architecture)

关键词: LLM推理 自蒸馏 教师暴露 自适应学习 策略优化

📋 核心要点

现有LLM推理自蒸馏方法中，教师模型总是看到完整的参考推理，这导致教师端暴露不匹配，产生过强的学习目标。
ATESD通过一个Beta策略控制器自适应地调整教师模型的暴露程度，该控制器基于学生模型的训练状态进行决策。
实验结果表明，ATESD在多个数据集和模型上显著优于现有的自蒸馏和强化学习基线方法，提升了LLM的推理能力。

📝 摘要（中文）

本文提出了一种针对LLM推理的自蒸馏方法，称为自适应教师暴露的自蒸馏(ATESD)。现有的on-policy自蒸馏方法通常让教师模型看到完整的参考推理过程，但本文认为这会导致教师端的暴露不匹配，即教师模型基于学生模型当前能力无法理解的推理过程进行监督，产生过强的token目标。实验表明，完全暴露并非总是最佳选择，且学生-教师不匹配程度随教师所见推理的增加而单调增长。因此，ATESD将教师暴露视为一个可学习的训练时控制变量，使用一个轻量级的Beta策略控制器，根据紧凑的训练状态统计信息来建模暴露比例，并使用采样的暴露比例进行短期的学生模型更新。为了使暴露控制器可学习，本文使用折扣的学习进度奖励来优化它，该奖励根据每个决策对学生模型未来改进的影响来评分，而不是立即的损失变化。在AIME 24、AIME 25和HMMT 25数据集上，对Qwen3-{1.7B, 4B, 8B}模型的实验表明，ATESD始终优于具有竞争力的自蒸馏和强化学习基线，平均@12指标分别提高了+0.95、+2.05和+2.33个点。

🔬 方法详解

问题定义：现有基于on-policy的LLM自蒸馏方法，通常让教师模型看到完整的参考答案，这会造成教师模型与学生模型能力不匹配的问题。当教师模型基于学生模型无法理解的推理过程进行监督时，会产生过强的token目标，阻碍学生模型的学习。

核心思路：核心在于将教师模型的暴露程度视为一个可学习的变量，而不是一个固定的超参数。通过控制教师模型能够看到的参考推理的比例，来缓解学生-教师之间的能力差距，从而提高自蒸馏的效果。

技术框架：ATESD包含一个Beta策略控制器和一个学生模型。Beta策略控制器根据学生模型的训练状态（例如损失、梯度等）来决定教师模型的暴露比例。学生模型则根据教师模型提供的部分参考推理进行学习。整个训练过程是一个on-policy的自蒸馏过程，学生模型的更新依赖于自身的rollout。

关键创新：最重要的创新在于引入了自适应的教师暴露机制。与传统的自蒸馏方法不同，ATESD能够根据学生模型的学习状态动态地调整教师模型的暴露程度，从而更好地平衡探索和利用，避免过强的学习目标。

关键设计：Beta策略控制器的输入是学生模型的训练状态统计信息，输出是教师模型的暴露比例。该控制器使用一个轻量级的神经网络实现。为了训练该控制器，论文使用了一个折扣的学习进度奖励，该奖励根据每个暴露决策对学生模型未来改进的影响来评分，而不是立即的损失变化。这种奖励函数能够解决on-policy自蒸馏中存在的延迟信用分配问题。

🖼️ 关键图片

📊 实验亮点

实验结果表明，ATESD在AIME 24、AIME 25和HMMT 25数据集上，对Qwen3-{1.7B, 4B, 8B}模型进行了测试，始终优于具有竞争力的自蒸馏和强化学习基线，平均@12指标分别提高了+0.95、+2.05和+2.33个点。这些结果表明，自适应教师暴露是一种有效的LLM推理自蒸馏方法。

🎯 应用场景

该研究成果可应用于各种需要LLM进行复杂推理的场景，例如数学问题求解、代码生成、知识问答等。通过自适应地调整教师模型的暴露程度，可以有效地提高LLM的推理能力和泛化性能，降低对大规模高质量训练数据的依赖，具有重要的实际应用价值和潜在的商业前景。

📄 摘要（原文）

On-policy self-distillation has become a strong recipe for LLM reasoning, where a privileged teacher supervises the student's own rollouts while conditioning on the reference solution. A design choice shared by nearly all such methods, however, has gone unquestioned: the teacher always sees the full reference reasoning. We argue that this default itself is part of the problem and identify a teacher-side exposure mismatch: when the teacher conditions on reasoning far beyond the student's current competence, the resulting token targets become too strong to absorb. A controlled fixed-exposure sweep makes this concrete on two fronts: 1) full exposure is not reliably the best choice, and 2) student-teacher mismatch grows monotonically as the teacher sees more privileged reasoning. This motivates treating teacher exposure not as a fixed hyperparameter but as a learnable training-time control variable. We therefore propose Adaptive Teacher Exposure for Self-Distillation (ATESD). ATESD models the reveal ratio with a lightweight Beta-policy controller conditioned on compact training-state statistics, and uses one sampled exposure for a short hold window of student updates. To make this exposure controller learnable, we optimize it with a discounted learning-progress reward that scores each held decision by its effect on the student's future improvement rather than its immediate loss change, addressing the delayed credit assignment induced by on-policy distillation. Experiments on AIME 24, AIME 25, and HMMT 25 across Qwen3-{1.7B, 4B, 8B} show that ATESD consistently outperforms competitive self-distillation and RL baselines, improving over OPSD by +0.95, +2.05, and +2.33 Average@12 points respectively, and establishing adaptive teacher exposure as an effective new axis for reasoning self-distillation.

Adaptive Teacher Exposure for Self-Distillation in LLM Reasoning

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理