Read the Trace, Steer the Path: Trajectory-Aware Reinforcement Learning for Diffusion Language Models

📄 arXiv: 2606.04396v1 📥 PDF

作者: Anant Khandelwal, Manish Gupta

分类: cs.CL

发布日期: 2026-06-03

备注: 19 pages, 10 figures, 7 Tables


💡 一句话要点

提出CAPR算法以优化扩散语言模型的强化学习

🎯 匹配领域: 支柱二:RL算法与架构 (RL & Architecture) 支柱九:具身大模型 (Embodied Foundation Models)

关键词: 扩散语言模型 强化学习 去噪轨迹 路径优化 计算效率 局部监督 PPO算法

📋 核心要点

  1. 现有的dLLM强化学习方法对去噪轨迹的利用不足,导致训练信号不够细致,影响模型性能。
  2. CAPR算法通过总结去噪轨迹为紧凑的路径状态,利用缓存轨迹生成廉价的兄弟延续,从而实现高效的局部监督。
  3. 在多个基准任务上,CAPR在计算成本上显著低于传统树形回合,同时在性能上达到了新的最优水平。

📝 摘要(中文)

扩散大型语言模型(dLLMs)通过迭代解码和修正多个位置生成响应,留下丰富的去噪轨迹。现有的dLLM强化学习方法对这一信号的利用较弱,平坦的回合生成成本低但仅为整个轨迹分配单一奖励,而树形回合提供更细致的训练信号但计算开销大。本文提出CAPR(缓存-摊销路径优化)算法,通过将去噪轨迹总结为紧凑的路径状态,利用缓存的轨迹状态生成廉价的兄弟延续,并为局部块级监督训练块级价值头。CAPR在多个任务上设定了新的强化学习调优dLLMs的性能基准。

🔬 方法详解

问题定义:本文旨在解决现有dLLM强化学习方法对去噪轨迹利用不足的问题,导致训练信号不够细致,影响模型的学习效果。

核心思路:CAPR算法的核心思路是利用去噪轨迹的结构信息,生成紧凑的路径状态,并通过缓存的轨迹状态生成廉价的兄弟延续,从而实现高效的局部块级监督。

技术框架:CAPR的整体架构包括路径状态记录、块进度特征提取和最终奖励的重新分配。首先记录路径状态和块进度特征,然后根据每个块中揭示的token重新分配最终奖励。

关键创新:CAPR的主要创新在于通过去噪轨迹提供树状监督,而无需进行完整的树扩展,从而显著降低了回合生成的计算成本。

关键设计:CAPR采用块级解码调度,设计了块级价值头以实现局部监督,并通过调整奖励分配机制将稀疏奖励转化为块级PPO权重。

📊 实验亮点

CAPR在4x4数独、倒计时、GSM8K和Math500等任务上,使用256和512-token预算设定了新的强化学习调优dLLMs的性能基准。在数独任务中,CAPR的性能与最强的树形基线相当,但计算成本不到其三分之一。

🎯 应用场景

CAPR算法在多个任务中表现出色,具有广泛的应用潜力,尤其是在需要高效强化学习的自然语言处理任务中。其优化的计算成本和性能提升使其在实际应用中具有重要价值,能够推动更复杂模型的训练与应用。

📄 摘要(原文)

Diffusion large language models (dLLMs) generate responses by iteratively unmasking and revising many positions in parallel. This process leaves a rich denoising trace depicting which tokens become confident, which remain unstable, and when commitments form. Existing dLLM reinforcement learning methods use this signal only weakly. Flat rollouts are cheap, but assign a single outcome reward to the whole trajectory. Tree rollouts provide finer, verifiable training signals by branching partial trajectories and propagating leaf rewards upward, but are compute intensive. We ask whether the denoising trace itself can provide tree-like supervision without tree-level compute. We introduce CAPR (Cached-Amortized Path Refinement), a dLLM-RL algorithm that summarizes the denoising trace into a compact path state, uses cached trajectory states to generate cheap sibling continuations, and trains a block-level value head for local block-wise supervision. Under a block-wise unmasking schedule, CAPR records path-state and block-progress features, then redistributes the final outcome reward across blocks according to the tokens revealed in each block. This trains the value head to convert one sparse reward into block-level PPO weights. CAPR therefore recovers much of the granularity of tree search while avoiding full tree expansion, reducing rollout-generation cost to roughly 0.75x that of flat rollouts and 0.6x that of tree rollouts (under standard settings). Across 4x4 Sudoku, Countdown, GSM8K, and Math500, on dense and mixture-of-experts LLaDA backbones, CAPR sets a new state of the art for RL-tuned dLLMs at 256- and 512-token budgets. On Sudoku, it matches the strongest tree-structured baseline at less than one third of the per-step compute.