DelTA: Discriminative Token Credit Assignment for Reinforcement Learning from Verifiable Rewards

📄 arXiv: 2605.21467v1 📥 PDF

作者: Kaiyi Zhang, Wei Wu, Yankai Lin

分类: cs.LG, cs.CL

发布日期: 2026-05-20


💡 一句话要点

提出DelTA以解决响应级奖励与token级概率变化不明的问题

🎯 匹配领域: 支柱二:RL算法与架构 (RL & Architecture) 支柱九:具身大模型 (Embodied Foundation Models)

关键词: 强化学习 可验证奖励 token信用分配 模型优化 自然语言处理 代码生成 机器学习

📋 核心要点

  1. 现有的RLVR方法在响应级奖励与token级概率变化之间的关系上存在理解不足,导致学习效果受限。
  2. 本文提出DelTA,通过判别性token信用分配来增强特定token梯度方向,改善奖励信号的利用效率。
  3. 在七个数学基准测试中,DelTA在Qwen3-8B-Base和Qwen3-14B-Base上分别提高了3.26和2.62的平均分数,显示出显著的性能提升。

📝 摘要(中文)

可验证奖励的强化学习(RLVR)已成为提升大型语言模型推理能力的核心技术。尽管其有效性显著,但响应级奖励如何转化为token级概率变化仍不明确。本文引入了RLVR更新的判别视角,展示了策略梯度更新方向如何作为token梯度向量的线性判别器,决定学习过程中哪些token概率被增加或减少。为了解决标准序列级RLVR中由共享高频模式主导的质心构建问题,本文提出了DelTA,一种判别性token信用分配方法,通过估计token系数来增强特定方向的token梯度,并降低共享或弱判别方向的影响。实验表明,DelTA在七个数学基准测试中优于同规模的最强基线。

🔬 方法详解

问题定义:本文旨在解决响应级奖励如何影响token级概率变化的问题,现有方法在处理共享高频模式时容易导致重要信息的稀释。

核心思路:通过引入判别视角,DelTA估计token系数以增强特定方向的token梯度,从而改善奖励信号的有效性。

技术框架:DelTA的整体架构包括token梯度的计算、判别性系数的估计和自归一化RLVR替代的重加权,形成更具对比性的质心。

关键创新:DelTA的核心创新在于通过判别性token信用分配,重新定义了token梯度的影响力,与传统方法相比,能够更有效地区分高奖励和低奖励的响应。

关键设计:在参数设置上,DelTA采用了优势加权平均的方式构建token梯度的质心,并设计了适应性损失函数以优化token系数的估计。

🖼️ 关键图片

fig_0
fig_1
fig_2

📊 实验亮点

DelTA在七个数学基准测试中表现优异,相较于同规模的基线模型,Qwen3-8B-Base和Qwen3-14B-Base分别提高了3.26和2.62的平均分数,显示出其在实际应用中的强大能力和有效性。

🎯 应用场景

DelTA的研究成果在自然语言处理、代码生成等领域具有广泛的应用潜力。通过提升模型对奖励信号的敏感性,DelTA能够帮助开发更智能的对话系统和自动化编程工具,推动人工智能在实际应用中的发展。

📄 摘要(原文)

Reinforcement learning from verifiable rewards (RLVR) has emerged as a central technique for improving the reasoning capabilities of large language models. Despite its effectiveness, how response-level rewards translate into token-level probability changes remains poorly understood. We introduce a discriminator view of RLVR updates, showing that the policy-gradient update direction implicitly acts as a linear discriminator over token-gradient vectors and thereby determines which token probabilities are increased or decreased during learning. Under standard sequence-level RLVR, this discriminator is constructed from positive- and negative-side centroids formed by advantage-weighted averaging of token-gradient vectors. However, such centroid construction can be dominated by shared high-frequency patterns, such as formatting tokens, diluting sparse yet discriminative directions that better distinguish high-reward responses from low-reward ones. To address this limitation, we propose $\textbf{DelTA}$, a discriminative token credit assignment method that estimates token coefficients to amplify side-specific token-gradient directions and downweight shared or weakly discriminative ones. These coefficients reweight a self-normalized RLVR surrogate, making the effective side-wise centroids more contrastive and thereby reshaping the RLVR update direction. On seven mathematical benchmarks, DelTA outperforms the strongest same-scale baselines by 3.26 and 2.62 average points on Qwen3-8B-Base and Qwen3-14B-Base, respectively. Additional results on code generation, a different backbone, and out-of-domain evaluations further demonstrate the generalization ability of DelTA.