Open-Source Image Editing Models Are Zero-Shot Vision Learners

作者: Wei Liu, Jiaxin Lin, Rui Chen

分类: cs.CV, cs.CL

发布日期: 2026-05-06

💡 一句话要点

评估开源图像编辑模型的零-shot视觉学习能力

🎯 匹配领域: 支柱三：空间感知与语义 (Perception & Semantics)

关键词: 开源模型 零-shot学习 图像编辑 视觉理解 深度估计 语义分割 计算机视觉

📋 核心要点

现有研究主要依赖闭源模型或任务特定的微调，无法验证开源图像编辑模型的零-shot视觉能力。
本文通过系统评估三种开源图像编辑模型，探索其在密集视觉预测任务中的表现，且不进行任何微调。
实验结果表明，开源模型在视觉理解上表现出显著的零-shot能力，部分结果超越了微调模型，具有重要的研究价值。

📝 摘要（中文）

近期研究表明，大型生成模型能够解决未经过明确训练的视觉任务。然而，现有证据主要依赖于闭源模型或需要特定任务的指令微调，尚不清楚公开可用的图像编辑模型是否具备零-shot视觉能力。本文系统评估了三种开源图像编辑模型——Qwen-Image-Edit、FireRed-Image-Edit和LongCat-Image-Edit，在不进行任何微调的情况下，针对密集视觉预测任务进行基准测试。结果显示，这些模型在视觉理解上表现出非平凡的零-shot能力，FireRed-Image-Edit在NYUv2表面法线估计中取得了17.69°的均值角误差，超越了微调后的Marigold（20.86°）并与指令微调的Vision Banana（17.78°）相匹配。所有代码、评估脚本和结果均已公开，以便为未来的研究提供可重复的基线。

🔬 方法详解

问题定义：本文旨在探讨开源图像编辑模型在未经过特定训练的情况下，是否具备零-shot视觉学习能力。现有方法主要依赖闭源模型或任务特定的微调，限制了研究的广度和适用性。

核心思路：通过对三种开源图像编辑模型进行系统评估，验证其在密集视觉预测任务中的表现，探索零-shot视觉能力的潜在来源。

技术框架：研究包括三个主要阶段：首先是模型选择，选取Qwen-Image-Edit、FireRed-Image-Edit和LongCat-Image-Edit；其次是任务设置，涵盖单目深度估计、表面法线估计和语义分割；最后是性能评估，比较模型在不同任务上的表现。

关键创新：本研究的创新点在于首次系统评估开源图像编辑模型的零-shot视觉能力，且不依赖于特定的微调过程，揭示了图像编辑预训练的潜在优势。

关键设计：在实验中，使用了标准数据集（如NYUv2和Cityscapes）进行基准测试，采用了均值角误差和mIoU等指标来评估模型性能，确保了结果的可靠性和可比性。

🖼️ 关键图片

📊 实验亮点

实验结果显示，FireRed-Image-Edit在NYUv2表面法线估计中取得了17.69°的均值角误差，超越了微调后的Marigold（20.86°）并与Vision Banana（17.78°）相当。在NYUv2深度估计中，LongCat-Image-Edit达到了δ1=0.822，Qwen-Image-Edit在DIODE Indoor上取得了δ1=0.868，表现出色。

🎯 应用场景

该研究的潜在应用领域包括自动图像编辑、计算机视觉任务的迁移学习以及生成模型的优化。通过验证开源模型的零-shot能力，研究为未来的视觉任务提供了新的思路，可能推动相关技术在实际应用中的广泛采用。

📄 摘要（原文）

Recent studies have shown that large generative models can solve vision tasks they were not explicitly trained for. However, existing evidence relies on closed-source models~(Veo~3, Nano Banana Pro) or requires task-specific instruction tuning, leaving open whether publicly available image-editing models possess zero-shot vision abilities out of the box. We conduct a systematic evaluation of three open-source image-editing models -- Qwen-Image-Edit, FireRed-Image-Edit, and LongCat-Image-Edit -- on dense visual prediction tasks \emph{without any fine-tuning}. We benchmark monocular depth estimation on NYUv2 and DIODE, surface normal estimation on NYUv2, and semantic segmentation on Cityscapes, covering both geometric and semantic scene understanding. Results show that open-source image-editing models exhibit non-trivial zero-shot visual understanding. On NYUv2 surface normals, FireRed-Image-Edit achieves a mean angular error of $17.69^\circ$, surpassing the fine-tuned Marigold ($20.86^\circ$) and matching the instruction-tuned Vision Banana ($17.78^\circ$) without any task-specific training. On NYUv2 depth estimation, LongCat-Image-Edit obtains $δ_1{=}0.822$ with affine alignment, and Qwen-Image-Edit leads on DIODE Indoor ($δ_1{=}0.868$). On Cityscapes semantic segmentation, Qwen-Image-Edit reaches 25.7 mIoU at the 19-class level and 49.5 mIoU at a coarser 7-category level. By comparing three independently trained editors, we test whether zero-shot vision ability is an emergent property of image-editing pretraining rather than a model-specific artifact. Code, evaluation scripts, and all results are publicly released to serve as a reproducible baseline for future work.

Open-Source Image Editing Models Are Zero-Shot Vision Learners

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理