Unified Multimodal Chain-of-Thought Reward Model through Reinforcement Fine-Tuning

📄 arXiv: 2505.03318v3 📥 PDF

作者: Yibin Wang, Zhimin Li, Yuhang Zang, Chunyu Wang, Qinglin Lu, Cheng Jin, Jiaqi Wang

分类: cs.CV

发布日期: 2025-05-06 (更新: 2025-10-29)

备注: [NeurIPS2025] Project Page: https://codegoat24.github.io/UnifiedReward/think


💡 一句话要点

提出UnifiedReward-Think,一种基于强化微调的统一多模态CoT奖励模型。

🎯 匹配领域: 支柱九:具身大模型 (Embodied Foundation Models)

关键词: 多模态学习 奖励模型 链式思考 强化学习 视觉任务

📋 核心要点

  1. 现有奖励模型在多模态任务中推理深度不足,导致奖励信号不准确,难以有效对齐视觉模型。
  2. 提出UnifiedReward-Think,通过引入显式长链思考(CoT)过程,提升奖励模型推理的可靠性和鲁棒性。
  3. 采用探索驱动的强化微调方法,包括CoT蒸馏、大规模数据推理激发和GRPO强化学习,实验证明模型性能优越。

📝 摘要(中文)

本文提出了一种统一的多模态链式思考(CoT)奖励模型UnifiedReward-Think,旨在通过引入显式的长链思考过程来增强奖励信号的可靠性和鲁棒性,从而提升视觉模型与人类偏好的一致性。现有奖励模型通常只能提供直接响应或进行浅层推理,导致奖励信号不准确。UnifiedReward-Think通过探索驱动的强化微调方法来激发模型的潜在复杂推理能力。该方法首先使用少量图像生成偏好数据蒸馏GPT-4o的推理过程,然后利用大规模统一多模态偏好数据来激发模型在各种视觉任务中的推理过程。最后,通过Group Relative Policy Optimization (GRPO) 进行强化微调,使模型能够探索不同的推理路径并优化正确且鲁棒的解决方案。在各种视觉奖励任务上的大量实验表明了该模型的优越性。

🔬 方法详解

问题定义:现有视觉奖励模型通常只能给出直接的奖励信号,或者进行非常浅层的推理,无法进行深度的、多步骤的思考。这导致奖励信号不够准确,无法有效地引导视觉模型学习人类的偏好。因此,需要一种能够进行长链思考的奖励模型,以提供更可靠和鲁棒的奖励信号。

核心思路:本文的核心思路是将链式思考(Chain-of-Thought, CoT)引入到多模态奖励模型中,使模型能够像人类一样,逐步地、多维度地进行推理,从而更准确地评估视觉任务的奖励。通过强化学习,鼓励模型探索不同的推理路径,并优化得到正确且鲁棒的解决方案。

技术框架:UnifiedReward-Think的整体框架包含三个主要阶段:1) CoT蒸馏:使用少量图像生成偏好数据,从GPT-4o中蒸馏出CoT推理的格式和结构,作为模型的冷启动。2) 大规模数据推理激发:利用大规模统一多模态偏好数据,激发模型在各种视觉任务中的推理过程,并保留正确的推理输出用于后续的拒绝采样。3) GRPO强化微调:使用预测错误的样本,通过Group Relative Policy Optimization (GRPO) 进行强化微调,使模型能够探索不同的推理路径,并优化得到正确且鲁棒的解决方案。

关键创新:该论文的关键创新在于:1) 提出了第一个统一的多模态CoT奖励模型,能够进行多维度、逐步的长链推理。2) 采用探索驱动的强化微调方法,有效地激发了模型的潜在复杂推理能力。3) 将CoT推理与强化学习相结合,使模型能够探索不同的推理路径,并优化得到更鲁棒的解决方案。

关键设计:在CoT蒸馏阶段,使用了少量高质量的图像生成偏好数据,确保模型能够学习到正确的CoT推理格式。在大规模数据推理激发阶段,构建了统一的多模态偏好数据集,覆盖了各种视觉任务。在GRPO强化微调阶段,使用了Group Relative Policy Optimization算法,鼓励模型探索不同的推理路径,并优化得到更鲁棒的解决方案。具体的参数设置和网络结构细节在论文中未详细说明,属于未知信息。

🖼️ 关键图片

fig_0
fig_1
fig_2

📊 实验亮点

实验结果表明,UnifiedReward-Think在各种视觉奖励任务上都取得了显著的性能提升。具体的数据和对比基线在摘要中未提及,属于未知信息。但论文强调了该模型在提供更准确和鲁棒的奖励信号方面的优越性。

🎯 应用场景

该研究成果可广泛应用于视觉模型的对齐和优化,例如图像生成、视觉问答、目标检测等领域。通过提供更准确和鲁棒的奖励信号,可以提升视觉模型与人类偏好的一致性,从而改善用户体验,并推动人工智能在视觉领域的应用。

📄 摘要(原文)

Recent advances in multimodal Reward Models (RMs) have shown significant promise in delivering reward signals to align vision models with human preferences. However, current RMs are generally restricted to providing direct responses or engaging in shallow reasoning processes with limited depth, often leading to inaccurate reward signals. We posit that incorporating explicit long chains of thought (CoT) into the reward reasoning process can significantly strengthen their reliability and robustness. Furthermore, we believe that once RMs internalize CoT reasoning, their direct response accuracy can also be improved through implicit reasoning capabilities. To this end, this paper proposes UnifiedReward-Think, the first unified multimodal CoT-based reward model, capable of multi-dimensional, step-by-step long-chain reasoning for both visual understanding and generation reward tasks. Specifically, we adopt an exploration-driven reinforcement fine-tuning approach to elicit and incentivize the model's latent complex reasoning ability: (1) We first use a small amount of image generation preference data to distill the reasoning process of GPT-4o, which is then used for the model's cold start to learn the format and structure of CoT reasoning. (2) Subsequently, by leveraging the model's prior knowledge and generalization capabilities, we prepare large-scale unified multimodal preference data to elicit the model's reasoning process across various vision tasks. During this phase, correct reasoning outputs are retained for rejection sampling to refine the model (3) while incorrect predicted samples are finally used for Group Relative Policy Optimization (GRPO) based reinforcement fine-tuning, enabling the model to explore diverse reasoning paths and optimize for correct and robust solutions. Extensive experiments across various vision reward tasks demonstrate the superiority of our model.