Exploiting Verification-Generation Gap: Test-Time Reinforcement Learning with Confidence-Conditioned Verification
作者: Jiahui Li, Jianfeng Shan, Wenpei Chen, Shunyu Wu, Jian Lou, Wenjie Feng, Dan Li, See-Kiong Ng
分类: cs.LG, cs.AI
发布日期: 2026-06-02
🔗 代码/项目: GITHUB
💡 一句话要点
提出TTRL-CoCoV以解决标签无关强化学习中的Pass@k优化问题
🎯 匹配领域: 支柱二:RL算法与架构 (RL & Architecture) 支柱九:具身大模型 (Embodied Foundation Models)
关键词: 测试时强化学习 无标签学习 信心条件机制 生成覆盖率 多样性崩溃 伪标签选择 推理能力提升
📋 核心要点
- 现有方法在无标签设置中优化Pass@k的研究较少,导致生成覆盖率不足,影响探索效果。
- TTRL-CoCoV通过信心条件机制,针对不同置信度样本采取不同策略,从而提升生成能力和多样性。
- 实验结果显示,TTRL-CoCoV在多个基准测试中显著提升性能,尤其在推理基准上相较于完全监督的RL方法也有显著改善。
📝 摘要(中文)
测试时强化学习(TTRL)作为一种新兴范式,能够在完全无标签的情况下增强大型语言模型的复杂推理能力。尽管现有研究主要集中在Pass@1性能上,但在无标签设置中优化Pass@k仍然是一个未被充分探索的重要问题。通过深入的实证分析,本文发现低置信度样本的伪标签估计容易出错,而高置信度样本的候选答案则面临严重的多样性崩溃。为了解决这些问题,本文提出了一种新颖的信心自适应框架TTRL-CoCoV,旨在扩展Pass@k覆盖率并提升Pass@1性能。实验结果表明,TTRL-CoCoV在六个广泛认可的基准上超越了最佳竞争方法,Pass@1和Pass@16的平均绝对增益分别为+9.8%和+18.7%。
🔬 方法详解
问题定义:本文旨在解决在无标签设置中优化Pass@k的挑战,现有方法在处理低置信度样本时容易产生错误伪标签,而高置信度样本则面临多样性崩溃的问题。
核心思路:TTRL-CoCoV的核心思路是通过信心条件机制来调整生成策略。对于高置信度样本,利用验证器提升生成多样性;对于低置信度样本,依赖验证器过滤错误伪标签;而中等置信度样本则直接生成,避免验证过程。
技术框架:TTRL-CoCoV的整体架构包括三个主要模块:高置信度样本的验证器和奖励机制、中等置信度样本的直接生成、低置信度样本的伪标签选择。通过这些模块的协同作用,提升了生成的覆盖率和准确性。
关键创新:TTRL-CoCoV的创新点在于引入了信心条件机制,使得模型能够根据样本的置信度动态调整生成策略,这一设计显著改善了生成的多样性和准确性。
关键设计:在参数设置上,模型根据样本置信度动态调整奖励函数,确保高置信度样本的多样性,同时在低置信度样本中引入验证器以提高伪标签的准确性。
🖼️ 关键图片
📊 实验亮点
实验结果显示,TTRL-CoCoV在六个基准测试中超越了最佳竞争方法,Pass@1和Pass@16的平均绝对增益分别为+9.8%和+18.7%。在多个推理基准上,TTRL-CoCoV相较于完全监督的强化学习方法也实现了最高达+5.0%的绝对提升。
🎯 应用场景
该研究的潜在应用领域包括自然语言处理、对话系统和自动生成内容等。通过提升模型在无标签环境下的推理能力,TTRL-CoCoV能够在实际应用中更好地处理复杂任务,具有广泛的实际价值和未来影响。
📄 摘要(原文)
Test-time reinforcement learning has emerged as a promising paradigm for enhancing the complex reasoning abilities of large language models in a completely label-free manner. Despite existing studies focusing on Pass@1 performance, optimizing Pass@k remains under-explored yet critical in label-free settings, which measures generation coverage for sustained exploration. Optimizing Pass@k in label-free setting is highly non-trivial, as directly applying the Pass@k advantage designs effective for RLVR yields unsatisfactory performance. Through in-depth empirical analysis, we discover the root causes hindering performance: pseudo-label estimations for low-confidence samples have a high probability of being incorrect, while candidate answers for high-confidence samples suffer from severe diversity collapse. To overcome these hurdles, we propose TTRL-CoCoV (Test-Time Reinforcement Learning with Confidence-Conditioned Verification), a novel confidence-adaptive framework that expands Pass@k coverage and improves Pass@1 performance. Based on our key insight that verification capability generally leads generation capability, TTRL-CoCoV employs a confidence-conditioned mechanism: for high-confidence samples, it bootstraps verifier and applies an exploration-enhancing reward to prevent diversity collapse; for low-confidence samples, it delegates pseudo-label selection to the verifier to filter incorrect pseudo-labels; and for medium-confidence samples, it bypasses verification entirely. Extensive experiments demonstrate that TTRL-CoCoV outperforms the best competing methods across 6 widely-recognized benchmarks, achieves average absolute gains of +9.8% in Pass@1 and +18.7% in Pass@16 over TTRL, and even achieves absolute Pass@1 improvements of up to +5.0% across multiple reasoning benchmarks when compared against fully supervised RL methods. Our code repository: https://github.com/shanjf666/CoCoV.