BalCapRL: A Balanced Framework for RL-Based MLLM Image Captioning
作者: Shaokai Ye, Vasileios Saveris, Yihao Qian, Jiaming Hu, Elmira Amirloo, Peter Grasch
分类: cs.CV, cs.AI
发布日期: 2026-05-08
💡 一句话要点
提出BalCapRL框架,通过多目标强化学习优化多模态大模型的图像描述质量
🎯 匹配领域: 支柱二:RL算法与架构 (RL & Architecture) 支柱九:具身大模型 (Embodied Foundation Models)
关键词: 多模态大模型 图像描述 强化学习 多目标优化 奖励建模 计算机视觉
📋 核心要点
- 现有RL方法在图像描述任务中存在权衡困境,效用导向目标易引发幻觉,而竞技场风格目标则导致描述过于泛化。
- 提出BalCapRL框架,通过联合优化效用感知准确性、参考覆盖率与语言质量,实现多目标平衡的强化学习训练。
- 实验表明,该方法在LLaVA-1.5-7B和Qwen2.5-VL模型上表现优异,在多个评价指标上均获得显著的性能增益。
📝 摘要(中文)
图像描述是计算机视觉的基础任务,在多模态大模型(MLLM)时代备受关注。为追求更详尽准确的描述,强化学习(RL)被广泛应用。然而,现有方法往往在描述质量的多个维度上存在权衡:效用导向的目标可能导致幻觉或冗长,而竞技场风格的目标则倾向于流畅但泛化的描述。为此,本文提出了BalCapRL框架,旨在联合优化效用感知准确性、参考覆盖率和语言质量。通过应用GDPO风格的奖励解耦归一化处理连续奖励,并引入长度条件奖励掩码,该方法在LLaVA-1.5-7B及Qwen2.5-VL系列模型上实现了显著性能提升,在DCScore、CaptionQA及CapArena指标上均取得大幅增长。
🔬 方法详解
问题定义:现有基于RL的图像描述方法通常单一优化特定指标,导致模型在“效用(Utility)”与“流畅度(Fluency)”之间产生负面权衡,即追求下游任务表现时往往牺牲了语言的自然性与准确性。
核心思路:论文提出一种平衡的RL框架,通过多目标奖励函数设计,将效用感知准确性、参考覆盖率与语言质量整合,避免模型在单一维度上过度拟合。
技术框架:该框架基于策略梯度优化,引入了针对连续值奖励的归一化机制,并结合长度条件奖励掩码,确保模型在生成长度与内容质量之间达到最优平衡。
关键创新:引入GDPO风格的奖励解耦归一化(Reward-Decoupled Normalization),有效解决了连续奖励在训练过程中的不稳定性问题,并提出了长度条件奖励掩码,提供更精准的长度惩罚机制。
关键设计:核心技术包括多目标奖励加权组合,以及对GRPO(Group Relative Policy Optimization)的改进,通过对不同奖励分量进行独立归一化,确保各优化目标在训练过程中能够协同收敛。
🖼️ 关键图片
📊 实验亮点
实验覆盖LLaVA-1.5-7B及Qwen2.5-VL(3B/7B)模型,结果显示BalCapRL在多项指标上均有显著提升。其中,DCScore提升最高达+13.6,CaptionQA提升+9.0,CapArena指标提升高达+29.0,证明了该框架在平衡多维度描述质量方面的卓越有效性。
🎯 应用场景
该研究可广泛应用于自动图像标注、多模态内容检索、辅助视觉系统及智能机器人感知模块。通过提升MLLM生成描述的准确性与自然度,该方法能显著改善下游视觉问答(VQA)系统的可靠性,并为构建更具人类对齐特性的多模态交互系统提供技术支撑。
📄 摘要(原文)
Image captioning is one of the most fundamental tasks in computer vision. Owing to its open-ended nature, it has received significant attention in the era of multimodal large language models (MLLMs). In pursuit of ever more detailed and accurate captions, recent work has increasingly turned to reinforcement learning (RL). However, existing captioning-RL methods and evaluation metrics often emphasize a narrow notion of caption quality, inducing trade-offs across core dimensions of captioning. For example, utility-oriented objectives can encourage noisy, hallucinated, or overlong captions that improve downstream question answering while harming fluency, whereas arena-style objectives can favor fluent but generic descriptions with limited usefulness. To address this, we propose a more balanced RL framework that jointly optimizes utility-aware correctness, reference coverage, and linguistic quality. In order to effectively optimize the resulting continuous multi-objective reward formulation, we apply GDPO-style reward-decoupled normalization to continuous-valued captioning rewards and show that it improves performance over vanilla GRPO. Additionally, we introduce length-conditional reward masking, yielding a more suitable length penalty for captioning. Across LLaVA-1.5-7B and Qwen2.5-VL 3B and 7B base models, our method consistently improves caption quality, with peak gains of +13.6 DCScore, +9.0 CaptionQA, and +29.0 CapArena across different models.