Reasoning Arena: Trace Tournaments When Verifiable Rewards Fall Short
作者: Han Zhou, Adam X. Yang, Laurence Aitchison, Anna Korhonen, Albert Q. Jiang
分类: cs.LG, cs.AI, cs.CL
发布日期: 2026-06-08
备注: 9 pages, 6 figures, 2 tables (17 pages including references and appendices)
💡 一句话要点
提出Reasoning Arena以解决可验证奖励不足的问题
🎯 匹配领域: 支柱二:RL算法与架构 (RL & Architecture) 支柱九:具身大模型 (Embodied Foundation Models)
关键词: 可验证奖励 强化学习 推理能力 轨迹比较 Bradley-Terry模型 自适应训练 动态更新
📋 核心要点
- 现有的可验证奖励方法在群体层面上常常变得无信息,当所有轨迹获得相同奖励时,无法提供有效的梯度信号。
- 本文提出Reasoning Arena,通过构建轨迹比赛来比较推理轨迹,从而在奖励不足的情况下仍能获取有用的梯度信号。
- 实验证明,Reasoning Arena在数学和编码基准测试中平均超越RLVR基线7.6%,并显著加速训练过程。
📝 摘要(中文)
可验证奖励强化学习(RLVR)已成为提升大型语言模型推理能力的重要范式。然而,当所有采样轨迹获得相同奖励时,群体相对优势估计无法提供梯度信号。为此,本文提出Reasoning Arena,一个自适应训练框架,将此类非多样化奖励组引导至评判系统,而非简单丢弃。Reasoning Arena通过构建轨迹比赛,比较推理轨迹以揭示群体内的细微偏好,将推理质量转化为丰富的相对奖励信号。实验证明,该方法在数学和编码基准测试中平均超越RLVR基线7.6%,并加速训练27%至41%。
🔬 方法详解
问题定义:本文旨在解决可验证奖励强化学习中,当所有轨迹获得相同奖励时,无法提供有效梯度信号的问题。这导致推理质量无法被有效区分和利用。
核心思路:提出Reasoning Arena,通过构建轨迹比赛来比较推理轨迹,揭示群体内的细微偏好,从而将推理质量转化为相对奖励信号。这样设计的目的是为了在奖励不足的情况下仍能进行有效的训练。
技术框架:Reasoning Arena的整体架构包括轨迹生成、轨迹比较和奖励估计三个主要模块。首先生成推理轨迹,然后通过动态更新的锚点池进行轨迹比较,最后利用Bradley-Terry模型进行奖励估计。
关键创新:最重要的技术创新在于通过轨迹比赛而非简单的奖励比较来获取相对奖励信号。这种方法避免了传统方法中对每对轨迹进行二次比较的计算复杂性。
关键设计:在设计中,采用了动态更新的锚点池来提高比较效率,并利用Bradley-Terry模型处理不完全比较图,从而实现可扩展的强化学习集成。
🖼️ 关键图片
📊 实验亮点
实验结果显示,Reasoning Arena在数学和编码基准测试中平均超越RLVR基线7.6%。此外,该方法加速训练过程27%至41%,几乎节省了50%的生成计算资源,显著提升了整体推理性能。
🎯 应用场景
该研究的潜在应用领域包括自然语言处理、智能问答系统和教育技术等。通过提升模型的推理能力,Reasoning Arena能够在需要高质量推理的任务中发挥重要作用,未来可能推动更智能的交互系统和自动化决策支持工具的发展。
📄 摘要(原文)
Reinforcement learning with verifiable rewards (RLVR) has become a leading paradigm for improving the reasoning ability of large language models through outcome-based supervision. However, verifiable rewards frequently become uninformative at the group level: when all sampled traces of a given prompt receive identical rewards, group-relative advantage estimation provides no gradient signal, even though the traces may differ substantially in reasoning quality. We propose Reasoning Arena, an adaptive training framework that routes such non-diverse reward groups to a judge system instead of discarding them. Beyond examining the final answer, Reasoning Arena constructs trace tournaments, where reasoning traces are compared head-to-head to expose finer-grained preferences within the group, converting reasoning quality into rich relative reward signals. To make reward estimation efficient, rather than exhaustively comparing every pair, each new trace is evaluated against a small, dynamically updated pool of previously generated traces as anchors to efficiently establish a relative ranking. We then fit a Bradley-Terry model on the incomplete comparison graph, enabling scalable RL integration without quadratic pairwise comparisons. Empirical results demonstrate that Reasoning Arena consistently outperforms the RLVR baseline by 7.6% on average in competition mathematics and coding benchmarks. By converting otherwise wasted zero-advantage samples into useful gradient updates, our method accelerates training by 27% to 41%, saving nearly 50% of generation compute, and substantially improves overall reasoning performance.