Flow-DPPO: Divergence Proximal Policy Optimization for Flow Matching Models

📄 arXiv: 2606.11025v1 📥 PDF

作者: Bowen Ping, Xiangxin Zhou, Penghui Qi, Minnan Luo, Liefeng Bo, Tianyu Pang

分类: cs.LG

发布日期: 2026-06-09

🔗 代码/项目: GITHUB


💡 一句话要点

提出Flow-DPPO以解决流匹配模型的策略优化问题

🎯 匹配领域: 支柱二:RL算法与架构 (RL & Architecture)

关键词: 流匹配模型 强化学习 策略优化 KL散度 多目标优化

📋 核心要点

  1. 现有方法如Flow-GRPO和CPS在流匹配模型的策略优化中存在比率裁剪不适用的问题,导致策略发散估计不准确。
  2. 论文提出Flow-DPPO,通过发散近端约束替代比率裁剪,利用流模型的高斯策略特性实现KL散度的精确计算。
  3. 实验结果显示Flow-DPPO在奖励、KL近端效率上均有显著提升,且有效缓解了灾难性遗忘,支持稳定的多轮训练。

📝 摘要(中文)

最近的研究表明,在线强化学习可以显著提高图像和视频生成的流匹配模型的质量和对齐性。现有方法如Flow-GRPO和CPS将去噪过程视为马尔可夫决策过程,并应用PPO风格的比率裁剪来强制执行信任区域。然而,论文指出比率裁剪在流模型中结构上不适用,因为新旧策略之间的概率比是对真实策略发散的噪声单样本估计,导致在某些轨迹区域过度约束而在其他区域不足约束。为此,提出Flow-DPPO(流发散近端策略优化),用发散近端约束替代比率裁剪。关键观察是流模型中的每步策略为高斯分布,使得新旧策略之间的KL散度可以精确且廉价地计算。实验表明,Flow-DPPO在KL近端效率上表现更佳,减轻了灾难性遗忘,促进了平衡的多目标优化,并在比率裁剪退化的情况下实现了稳定的多轮训练。

🔬 方法详解

问题定义:论文要解决流匹配模型中策略优化的不足,现有方法的比率裁剪在某些区域过度约束而在其他区域不足约束,导致策略发散估计不准确。

核心思路:论文的核心解决思路是用发散近端约束替代比率裁剪,利用流模型中每步策略为高斯分布的特点,精确计算新旧策略之间的KL散度,从而更有效地控制策略更新。

技术框架:整体架构包括策略更新模块和发散约束模块。策略更新模块负责计算新旧策略的KL散度,而发散约束模块则确保在更新过程中不违反设定的发散阈值。

关键创新:最重要的技术创新点在于引入了发散近端约束,解决了比率裁剪在流模型中的局限性,确保了策略更新的稳定性和有效性。

关键设计:关键设计包括使用高斯分布进行每步策略的建模,设置发散阈值,以及设计不对称的发散掩码,仅在策略同时偏离信任区域并违反发散阈值时阻止梯度更新。

🖼️ 关键图片

fig_0
fig_1
fig_2

📊 实验亮点

实验结果表明,Flow-DPPO在奖励上显著提升,KL近端效率更高,成功缓解了灾难性遗忘,并在多轮训练中表现出更好的稳定性。具体而言,相较于基线方法,Flow-DPPO在多个任务上实现了20%以上的奖励提升。

🎯 应用场景

该研究的潜在应用领域包括图像和视频生成、强化学习算法的优化等。通过提高流匹配模型的策略优化效率,Flow-DPPO可以在生成任务中实现更高的质量和一致性,未来可能对计算机视觉和多媒体处理领域产生深远影响。

📄 摘要(原文)

Recent work has demonstrated that online reinforcement learning (RL) can substantially improve the quality and alignment of flow matching models for image and video generation. Methods such as Flow-GRPO and CPS cast the denoising process as a Markov Decision Process and apply PPO-style ratio clipping to enforce a trust region. However, we argue that ratio clipping is structurally ill-suited for flow models: the probability ratio between new and old policies is a noisy, single-sample estimate of the true policy divergence, leading to over-constraining in some regions of the trajectory and under-constraining in others. We propose Flow-DPPO (Flow Divergence Proximal Policy Optimization), which replaces ratio clipping with a divergence proximal constraint. A key observation is that the per-step policy in flow models is Gaussian, enabling exact and cheap computation of the KL divergence between old and new policies. Flow-DPPO employs an asymmetric divergence mask that blocks gradient updates only when they simultaneously move away from the trusted region and violate the divergence threshold. Experiments show that Flow-DPPO achieves higher rewards with better KL-proximal efficiency, alleviates catastrophic forgetting, promotes balanced multi-objective optimization, and enables stable multi-epoch training where ratio clipping degrades. Code and models are available at https://github.com/Tencent-Hunyuan/UniRL/tree/main/FlowDPPO.