JarvisArt: Liberating Human Artistic Creativity via an Intelligent Photo Retouching Agent

📄 arXiv: 2506.17612v1 📥 PDF

作者: Yunlong Lin, Zixu Lin, Kunjie Lin, Jinbin Bai, Panwang Pan, Chenxin Li, Haoyu Chen, Zhongdao Wang, Xinghao Ding, Wenbo Li, Shuicheng Yan

分类: cs.CV

发布日期: 2025-06-21

备注: 40 pages, 26 figures


💡 一句话要点

提出JarvisArt以解决传统照片修饰工具的使用门槛问题

🎯 匹配领域: 支柱九:具身大模型 (Embodied Foundation Models)

关键词: 智能修饰 多模态模型 用户意图理解 照片编辑 深度学习

📋 核心要点

  1. 现有的照片修饰工具虽然功能强大,但对用户的专业知识和操作技能要求高,限制了其普及和使用。
  2. JarvisArt通过多模态大语言模型驱动,能够理解用户意图并智能协调多种修饰工具,提升用户体验。
  3. JarvisArt在MMArt-Bench基准测试中表现优异,相较于GPT-4o在内容保真度上提升了60%的平均像素级指标。

📝 摘要(中文)

照片修饰已成为现代视觉叙事的重要组成部分,使用户能够捕捉美感并表达创造力。尽管专业工具如Adobe Lightroom功能强大,但需要大量专业知识和手动操作。现有的AI解决方案虽然提供了自动化,但通常缺乏灵活性和良好的泛化能力,无法满足多样化和个性化的编辑需求。为此,本文提出了JarvisArt,一个多模态大语言模型驱动的智能修饰代理,能够理解用户意图,模拟专业艺术家的推理过程,并智能协调Lightroom中的200多种修饰工具。JarvisArt经过两阶段训练,展示了用户友好的交互、卓越的泛化能力和对全局及局部调整的精细控制。

🔬 方法详解

问题定义:本文旨在解决传统照片修饰工具使用门槛高、缺乏个性化和灵活性的痛点。现有AI解决方案往往无法满足用户多样化的编辑需求。

核心思路:JarvisArt通过多模态大语言模型理解用户意图,模拟专业艺术家的推理过程,智能协调多种修饰工具,从而实现个性化的照片修饰。

技术框架:JarvisArt的整体架构包括两个主要阶段:首先是Chain-of-Thought监督微调,建立基本的推理和工具使用技能;其次是针对修饰的Group Relative Policy Optimization(GRPO-R),进一步提升决策能力和工具熟练度。

关键创新:JarvisArt的最大创新在于其多模态大语言模型的应用,能够有效理解用户意图并进行智能工具协调,这在现有的AI修饰工具中尚属首次。

关键设计:在训练过程中,JarvisArt采用了Chain-of-Thought微调策略和GRPO-R优化方法,确保其在决策和工具使用上的高效性和准确性。

🖼️ 关键图片

fig_0
fig_1
fig_2

📊 实验亮点

JarvisArt在MMArt-Bench基准测试中表现出色,相较于GPT-4o在内容保真度上提升了60%的平均像素级指标,同时保持了相似的指令跟随能力,展示了其在智能照片修饰领域的显著优势。

🎯 应用场景

JarvisArt的潜在应用场景包括专业摄影、社交媒体内容创作以及广告设计等领域。其智能化的修饰能力不仅能提高用户的工作效率,还能激发更多的创意表达,具有广泛的实际价值和未来影响。

📄 摘要(原文)

Photo retouching has become integral to contemporary visual storytelling, enabling users to capture aesthetics and express creativity. While professional tools such as Adobe Lightroom offer powerful capabilities, they demand substantial expertise and manual effort. In contrast, existing AI-based solutions provide automation but often suffer from limited adjustability and poor generalization, failing to meet diverse and personalized editing needs. To bridge this gap, we introduce JarvisArt, a multi-modal large language model (MLLM)-driven agent that understands user intent, mimics the reasoning process of professional artists, and intelligently coordinates over 200 retouching tools within Lightroom. JarvisArt undergoes a two-stage training process: an initial Chain-of-Thought supervised fine-tuning to establish basic reasoning and tool-use skills, followed by Group Relative Policy Optimization for Retouching (GRPO-R) to further enhance its decision-making and tool proficiency. We also propose the Agent-to-Lightroom Protocol to facilitate seamless integration with Lightroom. To evaluate performance, we develop MMArt-Bench, a novel benchmark constructed from real-world user edits. JarvisArt demonstrates user-friendly interaction, superior generalization, and fine-grained control over both global and local adjustments, paving a new avenue for intelligent photo retouching. Notably, it outperforms GPT-4o with a 60% improvement in average pixel-level metrics on MMArt-Bench for content fidelity, while maintaining comparable instruction-following capabilities. Project Page: https://jarvisart.vercel.app/.