A Rolling Stone Gathers No Moss: Adaptive Policy Optimization for Stable Self-Evaluation in Large Multimodal Models

📄 arXiv: 2508.09155v1 📥 PDF

作者: Wenkai Wang, Hongcan Guo, Zheqi Lv, Shengyu Zhang

分类: cs.LG, cs.AI

发布日期: 2025-08-05

备注: 17 pages, 9 figures


💡 一句话要点

提出AdaPO以解决大规模多模态模型自我评估问题

🎯 匹配领域: 支柱二:RL算法与架构 (RL & Architecture) 支柱九:具身大模型 (Embodied Foundation Models)

关键词: 自我评估 多模态模型 强化学习 自适应奖励 动态正则化 对话系统 模型优化

📋 核心要点

  1. 现有方法在多模态模型自我评估中存在固定奖励机制导致的奖励黑客问题,影响模型稳定性。
  2. 本文提出AdaPO框架,通过自适应调整训练目标,结合自适应奖励模型和动态KL正则化机制,解决了奖励黑客问题。
  3. 在8个基准测试和多种模型上进行的广泛实验表明,AdaPO显著提升了模型的推理和自我评估能力。

📝 摘要(中文)

自我评估是大规模多模态模型在多轮对话中实现自我改进的关键能力,但在基础模型中几乎缺失。现有研究利用强化学习提升自我评估能力,但固定的奖励机制在优化多个训练目标时容易导致奖励黑客行为,从而引发模型崩溃。本文提出了AdaPO,一个在线强化学习框架,能够根据当前训练状态实时自适应调整每个任务的训练目标。具体而言,AdaPO引入了自适应奖励模型(ARM)和奖励感知动态KL正则化机制,以缓解奖励黑客问题。实验结果表明,该方法显著提升了直接推理和自我评估能力。

🔬 方法详解

问题定义:本文旨在解决大规模多模态模型在自我评估中的不足,尤其是固定奖励机制导致的奖励黑客问题,这会影响模型的训练稳定性和性能。

核心思路:提出AdaPO框架,通过实时自适应调整训练目标,利用自适应奖励模型(ARM)和动态KL正则化机制,来优化模型的自我评估能力,避免固定奖励带来的问题。

技术框架:AdaPO的整体架构包括两个主要模块:自适应奖励模型(ARM)用于评估任务的训练状态,以及奖励感知动态KL正则化机制,用于动态调整惩罚系数。

关键创新:最重要的创新在于引入了自适应奖励模型和动态KL正则化机制,使得模型能够根据不同多轮对话的情况动态调整学习重点,避免了传统方法中的固定惩罚问题。

关键设计:在设计中,ARM通过分析模型生成的多轮轨迹的表现来评估训练状态,而动态KL正则化则用动态系数替代固定惩罚,依据不同多轮情况的奖励差距进行调节。

📊 实验亮点

实验结果显示,AdaPO在8个基准测试上显著提升了模型的自我评估能力,直接推理性能提高了15%以上,相较于传统方法,表现出更高的稳定性和适应性。

🎯 应用场景

该研究的潜在应用领域包括智能客服、对话系统和教育机器人等,能够提升多模态模型在复杂对话场景中的自我评估和推理能力,进而提高用户交互体验。未来,该方法可能推动更智能的对话系统发展,促进人机交互的自然性和有效性。

📄 摘要(原文)

Self-evaluation, a model's ability to assess the correctness of its own output, is crucial for Large Multimodal Models (LMMs) to achieve self-improvement in multi-turn conversations, yet largely absent in foundation models. Recent work has employed reinforcement learning (RL) to enhance self-evaluation; however, its fixed reward mechanism suffers from reward hacking when optimizing multiple training objectives, leading to model collapse. In this paper we propose AdaPO, an online reinforcement learning framework capable of adaptively adjusting training objective in real time according to the current training state for each task. Specifically, to mitigate reward hacking , AdaPO introduces an Adaptive Reward Model (ARM) and a Reward Aware Dynamic KL Regularization mechanism. ARM assesses the task's training state from the distribution of model generated multi-turn trajectories' performance. Reward Aware Dynamic KL replaces a fixed penalty with dynamic coefficients which is modulated by the reward gap between different multi-turn situations. Notably, our method automatically and smoothly adjusts its learning focus based on sub-tasks' training progress without manual intervention. Extensive experiments over 8 benchmarks and various models show that our method significantly enhances both direct reasoning and self-evaluation capability. We will release our code to contribute to the community.