CAPO: Towards Enhancing LLM Reasoning through Generative Credit Assignment
作者: Guofu Xie, Yunsheng Shi, Hongtao Tian, Ting Yao, Xiao Zhang
分类: cs.LG, cs.AI, cs.CL
发布日期: 2025-08-04 (更新: 2025-10-20)
备注: Work in progress
💡 一句话要点
提出CAPO以解决LLM推理中的奖励分配问题
🎯 匹配领域: 支柱二:RL算法与架构 (RL & Architecture) 支柱九:具身大模型 (Embodied Foundation Models)
关键词: 大型语言模型 强化学习 奖励分配 推理能力 生成模型
📋 核心要点
- 现有的RLVR方法对每个token分配相同奖励,导致精确信用分配困难,影响模型推理能力。
- CAPO利用通用LLM作为生成过程奖励模型,直接生成步骤批评,提供确定性token级别信用。
- 在多个数学基准和领域外基准上,CAPO在性能上优于传统的监督学习和强化学习微调方法。
📝 摘要(中文)
强化学习与可验证奖励(RLVR)通过基于规则的二元反馈提升了大型语言模型(LLM)的推理能力。然而,现有RLVR方法通常对每个token分配相同的奖励,这种粗粒度反馈妨碍了精确的信用分配,使得模型难以识别成功或失败的推理步骤,常导致次优策略。为了解决这些问题,本文提出了一种简单而高效的方法——信用分配策略优化(CAPO)。CAPO直接利用现成的通用LLM作为生成过程奖励模型(LLM-as-GenPRM),通过一次性生成所有步骤的批评,提供确定性的token级别信用。实验表明,CAPO在多个基准测试中表现优于基于监督学习和强化学习的微调方法。
🔬 方法详解
问题定义:本文旨在解决现有RLVR方法中粗粒度奖励分配的问题,导致模型难以识别推理步骤的成功与失败。
核心思路:CAPO通过利用现成的LLM作为生成过程奖励模型,直接生成每个步骤的批评,从而实现精确的token级别信用分配。这样的设计避免了训练辅助模型的复杂性,并提高了反馈的准确性。
技术框架:CAPO的整体架构包括输入LLM生成的步骤批评,基于每个步骤的正确性进行信用分配,最后通过投票机制增强反馈的准确性和鲁棒性。
关键创新:CAPO的主要创新在于将LLM作为生成过程奖励模型,提供确定性的反馈,与传统方法相比,避免了对高质量监督标签的依赖,并提高了在线强化学习的效率。
关键设计:在设计中,CAPO采用了投票机制来处理生成的批评,确保反馈的可靠性。此外,模型的参数设置和损失函数设计也经过精心调整,以优化学习过程。
📊 实验亮点
在多个数学基准测试中,CAPO相较于传统的监督学习和强化学习微调方法,表现出显著的性能提升,具体实验结果显示,CAPO在四个挑战性数学基准上均优于现有方法,提升幅度达到XX%。
🎯 应用场景
CAPO的研究成果在教育、自动化推理和智能问答系统等领域具有广泛的应用潜力。通过提升LLM的推理能力,该方法可以帮助开发更智能的对话系统和自动化决策支持工具,推动人工智能技术的进一步发展。
📄 摘要(原文)
Reinforcement Learning with Verifiable Rewards (RLVR) has improved the reasoning abilities of Large Language Models (LLMs) by using rule-based binary feedback. However, current RLVR methods typically assign the same reward to every token. This coarse-grained feedback hampers precise credit assignment, making it hard for models to identify which reasoning steps lead to success or failure, and often results in suboptimal policies. Methods like PPO provide credit assignment by value estimation, but yield inaccurate and unverifiable signals due to limited sampling. On the other hand, methods using Process Reward Models can provide step-wise rewards but suffer from several key limitations: they require high-quality process supervision labels, the feedback is unreliable due to probabilistic reward modeling, and their application in online reinforcement learning (RL) is time-consuming. To overcome these limitations, we introduce a simple but efficient method-Credit Assignment Policy Optimization (CAPO). Instead of training auxiliary models, CAPO directly leverages an off-the-shelf, general-purpose LLM as a Generative Process Reward Model (LLM-as-GenPRM) to generate all step-wise critique by one pass only based on the correctness of the step itself, providing deterministic token-level credits to refine the tokens that were originally assigned identical rule-based rewards. To further enhance the accuracy and robustness, we employ voting mechanisms that scale with the number of generated critiques. Extensive experiments on various backbones like Llama and Qwen models show that CAPO consistently outperforms supervised learning-based and RL-based fine-tuning methods across four challenging mathematical benchmarks and three out-of-domain benchmarks. Further analysis shows that CAPO can help the model to foster the learning of correct reasoning pathways leading to correct answers.