Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients

📄 arXiv: 2605.06650v1 📥 PDF

作者: Mingwei Xu, Hao Fang

分类: cs.CL

发布日期: 2026-05-07


💡 一句话要点

提出POPO框架:通过仅正样本策略优化实现大语言模型推理能力的提升

🎯 匹配领域: 支柱二:RL算法与架构 (RL & Architecture) 支柱九:具身大模型 (Embodied Foundation Models)

关键词: 强化学习 大语言模型 推理能力 策略优化 可验证奖励 重要性采样

📋 核心要点

  1. 现有RLVR方法依赖负样本进行对比,但在稀疏二元奖励下,负样本无法提供有效的梯度引导,且采样覆盖率极低。
  2. POPO框架通过仅利用正样本进行有界重要性采样,利用策略重分布机制自然涌现出隐式负梯度,摆脱了对负样本的依赖。
  3. 实验证明POPO在数学推理任务中表现卓越,Qwen-Math-7B模型在AIME 2025上达到36.67%,显著优于GRPO的30.00%。

📝 摘要(中文)

可验证奖励强化学习(RLVR)已成为提升大语言模型推理能力的主流范式。尽管GRPO通过分组正负样本简化了优势估计,但负样本往往无法提供有效的失败严重程度分级,且在稀疏二元奖励下,采样负样本难以覆盖有效的奖励信号。本文提出了仅正样本策略优化(POPO),这是一种仅利用在线正样本进行学习的RLVR框架。POPO通过对正样本集应用有界重要性采样,无需负样本即可通过重分布正样本概率自然涌现出隐式负梯度。此外,POPO引入了基于动量更新的孪生策略网络和孪生表示空间中的有界相似度惩罚,以替代传统的KL散度,从而稳定策略优化。在Qwen-Math-7B等模型上的实验表明,POPO在数学基准测试中表现优于GRPO,在AIME 2025中达到了36.67%的准确率。

🔬 方法详解

问题定义:论文旨在解决RLVR中负样本利用效率低的问题。现有方法(如GRPO)依赖负样本进行对比,但在稀疏二元奖励下,负样本无法区分失败程度,导致梯度信号噪声大且难以有效覆盖奖励空间。

核心思路:POPO的核心思想是“仅正样本学习”。通过仅强化正样本的概率分布,利用策略重分布的数学特性,使模型在优化过程中自然产生隐式负梯度,从而在无需显式负样本的情况下实现策略迭代。

技术框架:POPO框架包含两个核心组件:一是基于动量更新的孪生策略网络,用于提供稳定的参考基准;二是基于有界相似度惩罚的优化目标,替代了传统的KL散度,确保策略更新的平滑性。

关键创新:最重要的创新在于证明了通过正样本的有界重要性采样,可以实现与包含负样本的优化目标等价的梯度更新,从而消除了对负样本采样和评估的计算开销与不确定性。

关键设计:采用了动量自适应律来更新孪生网络,确保策略演化的稳定性;使用有界相似度惩罚项约束策略在孪生表示空间中的偏移,有效防止了策略坍塌或过度优化。

🖼️ 关键图片

fig_0
fig_1

📊 实验亮点

POPO在数学推理任务中展现了显著优势。在Qwen-Math-7B模型上,POPO在AIME 2025基准测试中取得了36.67%的准确率,相比基线GRPO的30.00%有明显提升。消融实验进一步验证了孪生网络和有界相似度惩罚机制在提升训练稳定性和最终性能方面的必要性。

🎯 应用场景

该方法主要应用于大语言模型的推理能力强化,特别是在数学、代码生成及逻辑推理等具有明确验证机制(Verifiable Rewards)的领域。其高效的训练范式可显著降低RL阶段的计算成本,并提升模型在复杂推理任务中的准确性,对构建高性能推理模型具有重要价值。

📄 摘要(原文)

Reinforcement learning with verifiable rewards (RLVR), due to the deterministic verification, becomes a dominant paradigm for enhancing the reasoning ability of large language models (LLMs). The community witnesses the rapid change from the Proximal Policy Optimization (PPO) to Group Relative Policy Optimization (GRPO), in which GRPO reduces the complicated advantage estimation with simple estimation over grouped positive and negative rollouts. However, we note that negative rollouts may admit no gradation of failure severity, and the combinatorial vastness makes penalizing a few sampled negatives unlikely to cover a meaningful reward signal under sparse binary rewards. In this work, we propose Positive-Only Policy Optimization (POPO), a novel RLVR framework in which learning can occur exclusively via online positive rollouts. Specifically, POPO utilizes bounded importance sampling over the positive rollout set. Thus, no disjoint negative rollouts are used for the gradient guidance. We show that implicit negative gradients can emerge naturally through reinforcing the positive probability via rollouts redistribution. Next, POPO stabilizes the policy optimization through two mechanisms. First, it applies a siamese policy network with a momentum-based adaptation law for stabilized policy evolution. Second, we replace the KL-divergence with a bounded similarity penalty term in the siamese representation space. We conduct extensive experiments using publicly available, well-established text-LLM models, e.g., the Qwen family, across all-level mathematical benchmarks. Our experiment demonstrates that POPO achieves performance comparable to, or even superior to GRPO. Notably, we show that POPO can achieve 36.67% in AIME 2025 with Qwen-Math-7B, outperforming GRPO 30.00%. Our ablation and sweep studies further illustrate the necessity and robustness of POPO components.