DelTA: Discriminative Token Credit Assignment for Reinforcement Learning from Verifiable Rewards

作者: Kaiyi Zhang, Wei Wu, Yankai Lin

分类: cs.LG, cs.CL

发布日期: 2026-05-20

💡 一句话要点

提出DelTA以解决响应级奖励与token级概率变化不明的问题

🎯 匹配领域: 支柱二：RL算法与架构 (RL & Architecture) 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 强化学习 可验证奖励 token信用分配 模型优化 自然语言处理 代码生成 机器学习

📋 核心要点

现有的RLVR方法在响应级奖励与token级概率变化之间的关系上存在理解不足，导致学习效果受限。
本文提出DelTA，通过判别性token信用分配来增强特定token梯度方向，改善奖励信号的利用效率。
在七个数学基准测试中，DelTA在Qwen3-8B-Base和Qwen3-14B-Base上分别提高了3.26和2.62的平均分数，显示出显著的性能提升。

📝 摘要（中文）

可验证奖励的强化学习（RLVR）已成为提升大型语言模型推理能力的核心技术。尽管其有效性显著，但响应级奖励如何转化为token级概率变化仍不明确。本文引入了RLVR更新的判别视角，展示了策略梯度更新方向如何作为token梯度向量的线性判别器，决定学习过程中哪些token概率被增加或减少。为了解决标准序列级RLVR中由共享高频模式主导的质心构建问题，本文提出了DelTA，一种判别性token信用分配方法，通过估计token系数来增强特定方向的token梯度，并降低共享或弱判别方向的影响。实验表明，DelTA在七个数学基准测试中优于同规模的最强基线。

🔬 方法详解

问题定义：本文旨在解决响应级奖励如何影响token级概率变化的问题，现有方法在处理共享高频模式时容易导致重要信息的稀释。

核心思路：通过引入判别视角，DelTA估计token系数以增强特定方向的token梯度，从而改善奖励信号的有效性。

技术框架：DelTA的整体架构包括token梯度的计算、判别性系数的估计和自归一化RLVR替代的重加权，形成更具对比性的质心。

关键创新：DelTA的核心创新在于通过判别性token信用分配，重新定义了token梯度的影响力，与传统方法相比，能够更有效地区分高奖励和低奖励的响应。

关键设计：在参数设置上，DelTA采用了优势加权平均的方式构建token梯度的质心，并设计了适应性损失函数以优化token系数的估计。

🖼️ 关键图片

📊 实验亮点

DelTA在七个数学基准测试中表现优异，相较于同规模的基线模型，Qwen3-8B-Base和Qwen3-14B-Base分别提高了3.26和2.62的平均分数，显示出其在实际应用中的强大能力和有效性。

🎯 应用场景

DelTA的研究成果在自然语言处理、代码生成等领域具有广泛的应用潜力。通过提升模型对奖励信号的敏感性，DelTA能够帮助开发更智能的对话系统和自动化编程工具，推动人工智能在实际应用中的发展。

📄 摘要（原文）

Reinforcement learning from verifiable rewards (RLVR) has emerged as a central technique for improving the reasoning capabilities of large language models. Despite its effectiveness, how response-level rewards translate into token-level probability changes remains poorly understood. We introduce a discriminator view of RLVR updates, showing that the policy-gradient update direction implicitly acts as a linear discriminator over token-gradient vectors and thereby determines which token probabilities are increased or decreased during learning. Under standard sequence-level RLVR, this discriminator is constructed from positive- and negative-side centroids formed by advantage-weighted averaging of token-gradient vectors. However, such centroid construction can be dominated by shared high-frequency patterns, such as formatting tokens, diluting sparse yet discriminative directions that better distinguish high-reward responses from low-reward ones. To address this limitation, we propose $\textbf{DelTA}$, a discriminative token credit assignment method that estimates token coefficients to amplify side-specific token-gradient directions and downweight shared or weakly discriminative ones. These coefficients reweight a self-normalized RLVR surrogate, making the effective side-wise centroids more contrastive and thereby reshaping the RLVR update direction. On seven mathematical benchmarks, DelTA outperforms the strongest same-scale baselines by 3.26 and 2.62 average points on Qwen3-8B-Base and Qwen3-14B-Base, respectively. Additional results on code generation, a different backbone, and out-of-domain evaluations further demonstrate the generalization ability of DelTA.

DelTA: Discriminative Token Credit Assignment for Reinforcement Learning from Verifiable Rewards

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理