Exploiting Verification-Generation Gap: Test-Time Reinforcement Learning with Confidence-Conditioned Verification

作者: Jiahui Li, Jianfeng Shan, Wenpei Chen, Shunyu Wu, Jian Lou, Wenjie Feng, Dan Li, See-Kiong Ng

分类: cs.LG, cs.AI

发布日期: 2026-06-02

🔗 代码/项目: GITHUB

💡 一句话要点

提出TTRL-CoCoV以解决标签无关强化学习中的Pass@k优化问题

🎯 匹配领域: 支柱二：RL算法与架构 (RL & Architecture) 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 测试时强化学习 无标签学习 信心条件机制 生成覆盖率 多样性崩溃 伪标签选择 推理能力提升

📋 核心要点

现有方法在无标签设置中优化Pass@k的研究较少，导致生成覆盖率不足，影响探索效果。
TTRL-CoCoV通过信心条件机制，针对不同置信度样本采取不同策略，从而提升生成能力和多样性。
实验结果显示，TTRL-CoCoV在多个基准测试中显著提升性能，尤其在推理基准上相较于完全监督的RL方法也有显著改善。

📝 摘要（中文）

测试时强化学习（TTRL）作为一种新兴范式，能够在完全无标签的情况下增强大型语言模型的复杂推理能力。尽管现有研究主要集中在Pass@1性能上，但在无标签设置中优化Pass@k仍然是一个未被充分探索的重要问题。通过深入的实证分析，本文发现低置信度样本的伪标签估计容易出错，而高置信度样本的候选答案则面临严重的多样性崩溃。为了解决这些问题，本文提出了一种新颖的信心自适应框架TTRL-CoCoV，旨在扩展Pass@k覆盖率并提升Pass@1性能。实验结果表明，TTRL-CoCoV在六个广泛认可的基准上超越了最佳竞争方法，Pass@1和Pass@16的平均绝对增益分别为+9.8%和+18.7%。

🔬 方法详解

问题定义：本文旨在解决在无标签设置中优化Pass@k的挑战，现有方法在处理低置信度样本时容易产生错误伪标签，而高置信度样本则面临多样性崩溃的问题。

核心思路：TTRL-CoCoV的核心思路是通过信心条件机制来调整生成策略。对于高置信度样本，利用验证器提升生成多样性；对于低置信度样本，依赖验证器过滤错误伪标签；而中等置信度样本则直接生成，避免验证过程。

技术框架：TTRL-CoCoV的整体架构包括三个主要模块：高置信度样本的验证器和奖励机制、中等置信度样本的直接生成、低置信度样本的伪标签选择。通过这些模块的协同作用，提升了生成的覆盖率和准确性。

关键创新：TTRL-CoCoV的创新点在于引入了信心条件机制，使得模型能够根据样本的置信度动态调整生成策略，这一设计显著改善了生成的多样性和准确性。

关键设计：在参数设置上，模型根据样本置信度动态调整奖励函数，确保高置信度样本的多样性，同时在低置信度样本中引入验证器以提高伪标签的准确性。

🖼️ 关键图片

📊 实验亮点

实验结果显示，TTRL-CoCoV在六个基准测试中超越了最佳竞争方法，Pass@1和Pass@16的平均绝对增益分别为+9.8%和+18.7%。在多个推理基准上，TTRL-CoCoV相较于完全监督的强化学习方法也实现了最高达+5.0%的绝对提升。

🎯 应用场景

该研究的潜在应用领域包括自然语言处理、对话系统和自动生成内容等。通过提升模型在无标签环境下的推理能力，TTRL-CoCoV能够在实际应用中更好地处理复杂任务，具有广泛的实际价值和未来影响。

📄 摘要（原文）

Test-time reinforcement learning has emerged as a promising paradigm for enhancing the complex reasoning abilities of large language models in a completely label-free manner. Despite existing studies focusing on Pass@1 performance, optimizing Pass@k remains under-explored yet critical in label-free settings, which measures generation coverage for sustained exploration. Optimizing Pass@k in label-free setting is highly non-trivial, as directly applying the Pass@k advantage designs effective for RLVR yields unsatisfactory performance. Through in-depth empirical analysis, we discover the root causes hindering performance: pseudo-label estimations for low-confidence samples have a high probability of being incorrect, while candidate answers for high-confidence samples suffer from severe diversity collapse. To overcome these hurdles, we propose TTRL-CoCoV (Test-Time Reinforcement Learning with Confidence-Conditioned Verification), a novel confidence-adaptive framework that expands Pass@k coverage and improves Pass@1 performance. Based on our key insight that verification capability generally leads generation capability, TTRL-CoCoV employs a confidence-conditioned mechanism: for high-confidence samples, it bootstraps verifier and applies an exploration-enhancing reward to prevent diversity collapse; for low-confidence samples, it delegates pseudo-label selection to the verifier to filter incorrect pseudo-labels; and for medium-confidence samples, it bypasses verification entirely. Extensive experiments demonstrate that TTRL-CoCoV outperforms the best competing methods across 6 widely-recognized benchmarks, achieves average absolute gains of +9.8% in Pass@1 and +18.7% in Pass@16 over TTRL, and even achieves absolute Pass@1 improvements of up to +5.0% across multiple reasoning benchmarks when compared against fully supervised RL methods. Our code repository: https://github.com/shanjf666/CoCoV.

Exploiting Verification-Generation Gap: Test-Time Reinforcement Learning with Confidence-Conditioned Verification

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理