Mind the Gap: Aligning Vision Foundation Models to Image Feature Matching

作者: Yuhan Liu, Jingwen Fu, Yang Wu, Kangyi Wu, Pengna Li, Jiayi Wu, Sanping Zhou, Jingmin Xin

分类: cs.CV

发布日期: 2025-07-14

备注: Accepted by ICCV 2025

💡 一句话要点

提出IMD框架，通过对齐视觉基础模型解决图像特征匹配中的多实例问题。

🎯 匹配领域: 支柱二：RL算法与架构 (RL & Architecture) 支柱六：视频提取与匹配 (Video Extraction) 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 图像特征匹配 视觉基础模型 扩散模型 跨图像交互 多实例匹配

📋 核心要点

现有方法在将视觉基础模型应用于图像特征匹配时，忽略了单图像理解与跨图像理解之间的错位问题，导致性能瓶颈。
IMD框架通过集成生成式扩散模型捕获实例级细节，并利用跨图像交互prompting模块促进图像对之间的信息交互。
实验表明，IMD在常用基准上取得了新的state-of-the-art，并在多实例基准IMIM上实现了显著的性能提升。

📝 摘要（中文）

本文提出了一种名为IMD（Image feature Matching with a pre-trained Diffusion model）的框架，旨在解决将视觉基础模型应用于图像特征匹配时存在的错位问题。这种错位源于基础模型侧重于单图像理解，而特征匹配需要跨图像理解。具体表现为，常用基础模型提取的嵌入与特征匹配所需的最优嵌入存在差异，且缺乏有效机制将单图像理解能力转化为跨图像理解。为了解决这个问题，IMD集成了生成式扩散模型以捕获实例级细节，并利用prompt机制提出了跨图像交互prompting模块，促进图像对之间的双向信息交互。此外，本文还提出了一个新的基准IMIM，专注于多实例场景。实验结果表明，IMD在常用基准上取得了新的state-of-the-art，并在IMIM上实现了12%的显著提升，表明该方法有效缓解了错位问题。

🔬 方法详解

问题定义：论文旨在解决图像特征匹配中，由于视觉基础模型与匹配任务目标不一致导致的性能下降问题。现有方法通常直接使用对比学习训练的基础模型，这些模型侧重于全局语义信息，忽略了实例级别的细节，并且缺乏有效的跨图像信息交互机制，尤其是在多实例场景下表现不佳。

核心思路：论文的核心思路是利用生成式扩散模型来弥补对比学习模型的不足，从而更好地适应图像特征匹配任务。扩散模型能够捕捉更精细的实例级别特征，并且通过prompt机制实现跨图像的信息交互，从而缓解单图像理解和跨图像理解之间的错位。

技术框架：IMD框架主要包含两个核心模块：1) 基于扩散模型的特征提取器：使用预训练的扩散模型提取图像的实例级特征。2) 跨图像交互Prompting模块：利用Prompt机制，通过设计特定的Prompt，引导扩散模型进行跨图像的信息交互，从而实现图像对之间的特征对齐。整体流程是，首先使用扩散模型提取两张图像的特征，然后通过跨图像交互Prompting模块进行特征融合和匹配。

关键创新：论文的关键创新在于：1) 将生成式扩散模型引入图像特征匹配领域，利用其强大的实例级特征提取能力。2) 提出了跨图像交互Prompting模块，通过Prompt机制实现图像对之间的双向信息交互，有效缓解了单图像理解和跨图像理解之间的错位问题。3) 提出了新的多实例图像匹配基准IMIM，更准确地评估了算法在复杂场景下的性能。

关键设计：在扩散模型的使用上，论文可能采用了特定的采样策略或损失函数来优化特征提取过程。跨图像交互Prompting模块的具体实现细节（例如Prompt的设计、交互方式等）是关键。此外，IMIM基准的构建方式（例如图像的选择、标注方式等）也会影响实验结果的可靠性。

🖼️ 关键图片

📊 实验亮点

IMD在常用图像特征匹配基准上取得了state-of-the-art的性能，并且在专门设计的多实例图像匹配基准IMIM上实现了12%的显著提升。这表明IMD能够有效缓解视觉基础模型在图像特征匹配中存在的错位问题，尤其是在多实例场景下表现出色。

🎯 应用场景

该研究成果可应用于图像检索、三维重建、视觉定位、增强现实等领域。通过提升图像特征匹配的准确性和鲁棒性，可以提高这些应用在复杂场景下的性能，例如在光照变化、视角变化、遮挡等情况下实现更可靠的匹配。

📄 摘要（原文）

Leveraging the vision foundation models has emerged as a mainstream paradigm that improves the performance of image feature matching. However, previous works have ignored the misalignment when introducing the foundation models into feature matching. The misalignment arises from the discrepancy between the foundation models focusing on single-image understanding and the cross-image understanding requirement of feature matching. Specifically, 1) the embeddings derived from commonly used foundation models exhibit discrepancies with the optimal embeddings required for feature matching; 2) lacking an effective mechanism to leverage the single-image understanding ability into cross-image understanding. A significant consequence of the misalignment is they struggle when addressing multi-instance feature matching problems. To address this, we introduce a simple but effective framework, called IMD (Image feature Matching with a pre-trained Diffusion model) with two parts: 1) Unlike the dominant solutions employing contrastive-learning based foundation models that emphasize global semantics, we integrate the generative-based diffusion models to effectively capture instance-level details. 2) We leverage the prompt mechanism in generative model as a natural tunnel, propose a novel cross-image interaction prompting module to facilitate bidirectional information interaction between image pairs. To more accurately measure the misalignment, we propose a new benchmark called IMIM, which focuses on multi-instance scenarios. Our proposed IMD establishes a new state-of-the-art in commonly evaluated benchmarks, and the superior improvement 12% in IMIM indicates our method efficiently mitigates the misalignment.

Mind the Gap: Aligning Vision Foundation Models to Image Feature Matching

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理