Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation

作者: Yecheng Wu, Song Han, Hai Cai

分类: cs.LG, cs.AI

发布日期: 2026-04-14

💡 一句话要点

Lightning OPD：通过离线On-Policy蒸馏高效后训练大型推理模型

🎯 匹配领域: 支柱二：RL算法与架构 (RL & Architecture) 支柱九：具身大模型 (Embodied Foundation Models)

关键词: On-Policy蒸馏 离线训练 大型语言模型 后训练 教师一致性 模型优化 数学推理 代码生成

📋 核心要点

现有On-Policy蒸馏（OPD）依赖实时教师模型推理，基础设施开销巨大，限制了其应用。
Lightning OPD通过离线预计算教师模型log-probabilities，并强制教师一致性，消除了对实时教师服务器的需求。
实验表明，Lightning OPD在数学推理和代码生成任务上实现了SOTA性能，并显著提升了训练效率。

📝 摘要（中文）

On-policy蒸馏(OPD)已成为大型语言模型的一种高效后训练范式。然而，标准的OPD需要在整个训练过程中使用实时的教师模型推理服务器，导致巨大的基础设施开销。本文研究了是否可以离线执行on-policy蒸馏。一种自然的方法是预先计算SFT rollouts上的教师模型log-probabilities，并在训练期间重复使用它们。然而，在实践中，这种离线变体无法可靠地匹配标准OPD的性能。为了理解这种差异，我们确定了OPD管道的一个先前被忽视的关键条件，我们称之为教师一致性。这个条件要求相同的教师模型用于监督微调和OPD。我们表明，违反教师一致性会引入不可消除的梯度偏差，导致离线和在线OPD收敛到次优固定点，而与训练时长无关。基于这一洞察，我们提出了Lightning OPD，一个通过预先计算SFT rollouts上的教师模型log-probabilities来强制执行教师一致性的离线on-policy蒸馏框架。这种设计完全消除了对实时教师服务器的需求。我们进一步表明，在教师一致性的前提下，Lightning OPD与标准OPD具有相同的最优解，具有有界的梯度差异和有助于防止策略漂移的隐式正则化效果。在数学推理和代码生成方面的大量实验表明，Lightning OPD以显著提高的效率实现了最先进的性能。从SFT初始化的Qwen3-8B-Base模型开始，Lightning OPD在短短30个GPU小时内就在AIME 2024上达到了69.9%，比标准OPD实现了4.0倍的加速，并大大降低了学术界对LLM后训练研究的门槛。

🔬 方法详解

问题定义：现有On-Policy蒸馏（OPD）方法在后训练大型语言模型时，需要一个持续运行的教师模型推理服务器，这带来了巨大的基础设施成本和部署复杂性。学术界和资源有限的研究者难以负担这种开销，阻碍了OPD的广泛应用和研究。

核心思路：本文的核心思路是实现离线的On-Policy蒸馏，即在训练前预先计算教师模型的输出，并在训练过程中重复使用这些预计算的结果。为了保证离线蒸馏的有效性，论文强调了“教师一致性”的重要性，即确保监督微调（SFT）和OPD阶段使用完全相同的教师模型。

技术框架：Lightning OPD框架主要包含以下几个阶段：1) 使用监督微调（SFT）训练一个初始模型。2) 使用该SFT模型生成rollouts，并计算每个token的log-probabilities，作为教师模型的输出。3) 在OPD训练阶段，使用预计算的教师模型输出作为目标，训练学生模型。整个过程无需在线访问教师模型。

关键创新：最重要的创新点在于强调并解决了离线OPD中的“教师一致性”问题。论文指出，如果SFT和OPD阶段使用的教师模型不一致，会导致梯度偏差，使得模型收敛到次优解。通过预计算教师模型输出并强制教师一致性，Lightning OPD能够保证离线蒸馏的有效性。

关键设计：Lightning OPD的关键设计包括：1) 预计算教师模型log-probabilities，并将其存储以供后续训练使用。2) 损失函数采用标准的On-Policy蒸馏损失，但目标值来自预计算的教师模型输出。3) 论文分析了在教师一致性条件下，Lightning OPD与标准OPD具有相同的最优解，并具有隐式的正则化效果。

🖼️ 关键图片

📊 实验亮点

Lightning OPD在AIME 2024数学推理任务上，使用Qwen3-8B-Base模型，仅用30个GPU小时就达到了69.9%的准确率，相比标准OPD实现了4.0倍的加速。这一结果表明，Lightning OPD在保证性能的同时，显著提高了训练效率，降低了计算成本。

🎯 应用场景

Lightning OPD降低了大型语言模型后训练的门槛，使得资源有限的研究者和开发者也能高效地进行模型优化和定制。该方法可广泛应用于各种需要模型微调和知识迁移的场景，例如特定领域的语言模型训练、模型压缩和加速等，具有重要的实际应用价值。

📄 摘要（原文）

On-policy distillation (OPD) has emerged as an efficient post-training paradigm for large language models. However, standard OPD requires a live teacher inference server throughout training, resulting in substantial infrastructure overhead. In this work, we investigate whether on-policy distillation can be performed offline. A natural approach is to precompute teacher log-probabilities once over SFT rollouts and reuse them during training. In practice, however, this offline variant fails to reliably match the performance of standard OPD. To understand this discrepancy, we identify a previously overlooked condition that is critical for any OPD pipeline, which we term teacher consistency. This condition requires that the same teacher model be used for both supervised fine-tuning and OPD. We show that violating teacher consistency introduces an irreducible gradient bias, causing both offline and online OPD to converge to a suboptimal fixed point regardless of training duration. Building on this insight, we propose Lightning OPD, an offline on-policy distillation framework that enforces teacher consistency by precomputing teacher log-probabilities over SFT rollouts. This design eliminates the need for a live teacher server entirely. We further show that, under teacher consistency, Lightning OPD shares the same optimum as standard OPD, with bounded gradient discrepancy and an implicit regularization effect that helps prevent policy drift. Extensive experiments on mathematical reasoning and code generation demonstrate that Lightning OPD achieves state-of-the-art performance with significantly improved efficiency. Starting from an SFT-initialized Qwen3-8B-Base model, Lightning OPD reaches 69.9% on AIME 2024 in just 30 GPU hours, achieving a 4.0x speedup over standard OPD and substantially lowering the barrier to entry for academic research on LLM post-training.

Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理