Reasoning to Attend: Try to Understand How Token Works
作者: Rui Qian, Xin Yin, Dejing Dou
分类: cs.CV
发布日期: 2024-12-23 (更新: 2025-03-13)
备注: This work has been accepted to CVPR 2025, please refer to https://github.com/rui-qian/READ
🔗 代码/项目: GITHUB
💡 一句话要点
提出READ框架,通过语义相似性引导LMMs关注目标区域,提升视觉定位能力。
🎯 匹配领域: 支柱九:具身大模型 (Embodied Foundation Models)
关键词: 大型多模态模型 视觉定位 语义相似性 注意力机制 灾难性遗忘
📋 核心要点
- 现有LMMs依赖$ exttt{
}$ token进行视觉定位,但对其工作原理缺乏深入研究,限制了模型优化。 - 提出READ框架,利用相似性图指导模型关注图像中的相关区域,提升视觉定位的鲁棒性。
- 实验表明,READ在ReasonSeg和RefCOCO(+/g)数据集上表现出色,且能有效避免灾难性遗忘。
📝 摘要(中文)
当前的大型多模态模型(LMMs)通常依赖于$ exttt{
🔬 方法详解
问题定义:现有的大型多模态模型(LMMs)在视觉定位任务中依赖于$ exttt{
核心思路:论文的核心思路是利用$ exttt{
技术框架:READ框架包含以下主要模块:1) LLaVA编码器和SAM解码器,用于提取图像和文本的特征表示。2) 相似性图生成模块,计算$ exttt{
关键创新:论文的关键创新在于提出了Similarity as Points (SasP)模块,该模块能够有效地利用相似性图中的信息,引导模型的注意力机制。与现有方法相比,SasP模块无需额外的训练或复杂的网络结构,可以即插即用地应用于现有的LMMs框架中。此外,论文还深入分析了$ exttt{
关键设计:SasP模块的关键设计在于如何将相似性图中的高激活点转化为有效的注意力指导信号。具体来说,该模块首先对相似性图进行归一化处理,然后选取top-k个激活值最高的点作为关键点。这些关键点被用于调整图像token的注意力权重,使得模型更加关注与$ exttt{
🖼️ 关键图片
📊 实验亮点
实验结果表明,READ框架在ReasonSeg和RefCOCO(+/g)数据集上取得了显著的性能提升。此外,READ在增强的FP-RefCOCO(+/g)数据集上的实验结果表明,该框架能够有效避免灾难性遗忘,保持模型的泛化能力。具体性能数据未在摘要中给出,需要在论文中查找。
🎯 应用场景
该研究成果可应用于智能视觉问答、图像编辑、机器人导航等领域。通过提升LMMs的视觉定位能力,可以实现更精确的人机交互和更智能的自动化任务。例如,在机器人导航中,可以利用该方法引导机器人关注目标物体,从而实现更准确的路径规划和目标抓取。
📄 摘要(原文)
Current Large Multimodal Models (LMMs) empowered visual grounding typically rely on $\texttt{
}$ tokens as a text prompt to jointly optimize the vision-language model (e.g., LLaVA) and the downstream task-specific model (e.g., SAM). However, we observe that little research has looked into how it works.In this work, we first visualize the similarity maps, which are obtained by computing the semantic similarity between the $\texttt{ }$ token and the image token embeddings derived from the last hidden layer in both the LLaVA encoder and SAM decoder. Intriguingly, we have found that a striking consistency holds in terms of activation responses in the similarity map, which reveals that what the $\texttt{ }$ token contributes to is semantic similarity within image-text pairs. Specifically, the $\texttt{ }$ token, a placeholder expanded in text vocabulary, extensively queries among individual tokenized image patches to match the semantics of an object from text to the paired image, while the Large Language Models (LLMs) are being fine-tuned. Upon the above findings, we present READ, which facilitates LMMs' resilient $\textbf{REA}$soning capability of where to atten$\textbf{D}$ under the guidance of highly activated points borrowed from similarity maps. Remarkably, READ features an intuitive design, Similarity as Points module (SasP), which can be seamlessly applied to $\texttt{ }$-like paradigms in a plug-and-play fashion. Also, extensive experiments have been conducted on ReasonSeg and RefCOCO(+/g) datasets. To validate whether READ suffers from catastrophic forgetting of previous skills after fine-tuning, we further assess its generation ability on an augmented FP-RefCOCO(+/g) dataset. All codes and models are publicly available at https://github.com/rui-qian/READ.