Trajectory as the Teacher: Few-Step Discrete Flow Matching via Energy-Navigated Distillation

作者: Amin Karimi Monsefi, Dominic Culver, Nikhil Bhendawade, Manuel R. Ciosici, Yizhe Zhang, Irina Belousova

分类: cs.LG, cs.AI, cs.CL

发布日期: 2026-05-08

💡 一句话要点

提出轨迹塑形离散流匹配（TS-DFM）方法，通过能量导航蒸馏实现高效文本生成

🎯 匹配领域: 支柱二：RL算法与架构 (RL & Architecture)

关键词: 离散流匹配 模型蒸馏 轨迹塑形 能量导航 文本生成 高效推理

📋 核心要点

现有离散流匹配蒸馏中，训练轨迹由随机跳转生成，缺乏质量评估，导致早期错误传播并限制了学生模型的性能上限。
TS-DFM引入能量导航机制，在训练阶段利用轻量级能量模型评估中间状态，筛选出更具连贯性的轨迹用于蒸馏。
实验表明，8步TS-DFM在保持推理成本不变的前提下，困惑度优于1024步教师模型，并超越了使用更多数据或更大模型的基线方法。

📝 摘要（中文）

离散流匹配通过迭代将噪声转化为文本，但通常需要数百次前向传播。现有蒸馏方法试图让学生模型复现长轨迹，但往往受限于训练轨迹本身的质量。本文指出，训练轨迹中盲目的随机跳转缺乏序列质量评估，导致早期错误在后续步骤中累积。为此，作者提出了轨迹塑形离散流匹配（TS-DFM），引入轻量级能量罗盘在训练阶段对中间状态进行导航，选择最优路径。该方法仅在训练时增加开销，推理成本保持不变。在170M参数的语言模型上，8步TS-DFM的困惑度比1024步教师模型低32%，且推理速度提升128倍，在多种分布和评估器下均表现出显著优势。

🔬 方法详解

问题定义：离散流匹配（DFM）在生成文本时依赖长序列的迭代去噪，蒸馏过程旨在将长轨迹压缩至少数几步。现有痛点在于，训练轨迹的生成过程是“盲目”的随机跳转，缺乏对序列质量的实时约束，导致学生模型被迫学习低质量的中间状态，从而限制了蒸馏效果。

核心思路：论文提出“轨迹即教师”的理念，认为瓶颈在于训练轨迹而非学生模型容量。通过引入能量函数作为“导航罗盘”，在训练过程中对候选轨迹进行筛选，确保蒸馏目标是高质量的序列路径。

技术框架：TS-DFM在训练阶段构建了一个导航循环：首先生成多个候选中间状态，利用预训练的能量模型评估其连贯性，选择最优路径作为蒸馏目标，随后更新学生模型以拟合该优化后的轨迹。

关键创新：核心创新在于将“能量导航”引入离散流匹配的蒸馏过程。与传统蒸馏不同，该方法通过在训练时进行路径重塑，从根本上提升了蒸馏目标的质量，且该过程完全不增加推理阶段的计算开销。

关键设计：采用轻量级能量模型作为评估器，在每一步中间状态进行采样与打分。通过这种“训练时塑形”策略，学生模型能够学习到更平滑、更具语义连贯性的生成轨迹，从而在极少步数下实现超越教师模型的性能。

🖼️ 关键图片

📊 实验亮点

在170M参数模型上，TS-DFM仅需8步推理即可达到比1024步教师模型低32%的困惑度，推理速度提升128倍。对比实验显示，该方法在困惑度指标上全面超越了使用6倍训练数据或5倍参数规模的基线模型，证明了轨迹质量优化在蒸馏过程中的决定性作用。

🎯 应用场景

该技术主要应用于大语言模型的高效推理领域，特别适用于对延迟敏感的实时生成场景，如智能对话系统、实时代码补全及边缘设备上的文本生成。通过大幅减少推理步数，该方法在保持高质量输出的同时，显著降低了计算资源消耗，为离散空间下的生成式AI部署提供了新的优化范式。

📄 摘要（原文）

Discrete flow matching generates text by iteratively transforming noise tokens into coherent language, but may require hundreds of forward passes. Distillation uses the multi-step trajectory to train a student to reproduce the process in a few steps. When the student underperforms, the usual explanation is insufficient capacity. We argue the opposite: the trajectory is the bottleneck, not the student. Each training trajectory is built through a chain of blind stochastic jumps with no evaluation of sequence quality; a single bad decision at an early midpoint propagates through subsequent steps, yet the student must imitate the result. Trajectory-Shaped Discrete Flow Matching (TS-DFM) replaces these blind jumps with guided navigation: a lightweight energy compass evaluates candidate continuations at each midpoint, selecting the most coherent. All shaping is training-only; inference cost is unchanged. On 170M-parameter language modeling, the shaped student at 8 steps achieves 32% lower perplexity than the 1,024-step teacher while being 128x faster, with gains consistent across source distributions and three evaluators of increasing scale. TS-DFM achieves the best perplexity of any discrete-generation baseline we compare against, including methods trained on 6x more data or using 5x larger models.

Trajectory as the Teacher: Few-Step Discrete Flow Matching via Energy-Navigated Distillation

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理