Edit-R2: Context-Aware Reinforcement Learning for Multi-Turn Image Editing

📄 arXiv: 2606.05950v1 📥 PDF

作者: Yuxiao Ye, Haoran He, Fangyuan Kong, Xintao Wang, Pengfei Wan, Kun Gai, Ling Pan

分类: cs.AI

发布日期: 2026-06-04


💡 一句话要点

提出Edit-R2以解决多轮图像编辑中的上下文保持问题

🎯 匹配领域: 支柱二:RL算法与架构 (RL & Architecture) 支柱九:具身大模型 (Embodied Foundation Models)

关键词: 多轮图像编辑 强化学习 上下文保持 多模态模型 意图重构

📋 核心要点

  1. 现有方法多局限于单轮编辑,无法有效处理用户在多轮指令下的上下文保持问题。
  2. Edit-R2通过重构操作会话意图,将分散的历史约束整合为明确的推理轨迹,支持多轮强化学习。
  3. 实验表明,Edit-R2在多轮上下文编辑中显著提升了性能,优于多个强基线模型。

📝 摘要(中文)

文本引导的图像编辑在扩散模型和统一多模态基础模型的推动下迅速发展。然而,大多数现有方法仍局限于单轮设置,忽视了用户通过一系列指令迭代精细化图像的多轮上下文编辑场景。为此,本文提出了Edit-R2,一个新颖的强化学习后训练框架,旨在有效整合历史约束并优化多轮编辑过程。实验结果表明,Edit-R2在多轮上下文编辑中显著提升了性能,并在与强基线的比较中表现出竞争力。

🔬 方法详解

问题定义:本文旨在解决多轮图像编辑中的上下文保持问题,现有方法在长文本历史中容易出现信息稀疏和状态污染,导致编辑效果下降。

核心思路:Edit-R2通过重构会话意图,将历史约束整合为推理轨迹,确保每轮编辑都能遵循新的指令,同时保持之前的编辑效果。

技术框架:Edit-R2的整体架构包括意图重构模块、生成模块和轨迹过滤机制。意图重构模块负责整合历史信息,生成模块则在离散文本空间和连续潜在空间中进行优化。

关键创新:Edit-R2的主要创新在于引入了多轮强化学习和轨迹过滤机制,有效抑制了状态污染问题,与现有方法相比,显著提升了多轮编辑的稳定性和效果。

关键设计:在参数设置上,Edit-R2采用了联合优化的目标函数,损失函数设计考虑了意图重构和图像生成的匹配度,网络结构则结合了多模态特征提取与生成能力。

🖼️ 关键图片

fig_0
fig_1
fig_2

📊 实验亮点

实验结果显示,Edit-R2在多轮上下文编辑任务中,相较于强基线模型,性能提升显著,具体在指令遵循、内容一致性和全局意识等自动化指标上均表现出色,验证了其有效性。

🎯 应用场景

该研究的潜在应用领域包括图像编辑软件、社交媒体内容创作以及虚拟现实中的实时图像处理。通过提升多轮编辑的效果,Edit-R2能够为用户提供更为流畅和直观的编辑体验,未来可能在创意产业中产生深远影响。

📄 摘要(原文)

Text-guided image editing has advanced rapidly with diffusion models and unified multimodal foundation models. However, most existing methods remain confined to single-turn settings, overlooking the more realistic scenario of multi-turn in-context editing, where users iteratively refine an image through a sequence of instructions. In this setting, a model must follow each new instruction while preserving accumulated session-level constraints, challenged by two coupled failure modes: long-context dilution, where sparse textual constraints become difficult to recover from growing interleaved image-text histories, and state contamination, where earlier editing mistakes degrade subsequent generations. We introduce Edit-R2, a novel reinforcement learning post-training framework for unified multimodal models. Edit-R2 reconstructs the operative session intent, which effectively consolidates scattered historical constraints into an explicit reasoning trace before each editing turn. It further enables multi-turn RL over both reasoning and generation through a unified objective that jointly optimizes intent reconstruction generation in discrete text space and flow-matching image generation in continuous latent space, while a trajectory filtering mechanism suppresses corrupted rollouts to stabilize training under state contamination. To support systematic evaluation, we introduce MICE-Bench, a large-scale benchmark for multi-turn in-context editing with automated metrics for instruction following (IF), content consistency (CC), and global awareness (GA) over accumulated session constraints. Experiments show that Edit-R2 substantially improves multi-turn in-context editing and achieves competitive performance compared against strong baselines.