GazeVLM: Active Vision via Internal Attention Control for Multimodal Reasoning
作者: Brown Ebouky, Gabriele Carrino, Niccolo Avogaro, Christoph Studer, Andrea Bartezzaghi, Mattia Rigotti
分类: cs.CV, cs.AI, cs.CL
发布日期: 2026-05-08
💡 一句话要点
提出GazeVLM架构,通过内部注意力控制实现主动视觉推理以解决VLM被动处理的局限。
🎯 匹配领域: 支柱九:具身大模型 (Embodied Foundation Models)
关键词: 主动视觉 多模态推理 注意力机制 视觉语言模型 空间选择性注意力 策略优化
📋 核心要点
- 现有VLM依赖被动处理大规模视觉标记,导致空间推理能力稀释并引发幻觉,难以处理高分辨率细节。
- GazeVLM引入自主生成的注视标记(
),通过元认知控制动态调整因果注意力掩码,实现选择性空间注意力。 - 实验表明,该模型在HRBench-4k/8k上性能显著提升,超越同量级SOTA模型约4%,且优于基于图像思考的代理方案。
📝 摘要(中文)
人类视觉推理依赖于主动视觉,即通过元认知控制驱动自上而下的目标导向注意力,在保持全局感知的同时动态聚焦于任务相关细节。相比之下,现代视觉语言模型(VLM)通常被动处理视觉信息,依赖于大规模标记上下文的静态累积,这不仅稀释了空间推理能力,还易诱发语言幻觉。为此,本文提出了GazeVLM,这是一种将元认知监督直接内化到推理循环中的多模态架构。通过赋予VLM自主生成注视标记(
🔬 方法详解
问题定义:现有VLM在处理高分辨率图像时,受限于静态上下文窗口,无法有效聚焦局部细节,导致空间推理能力下降,且容易产生幻觉。现有解决方案如图像裁剪或增加视觉Token往往引入额外的计算开销或破坏全局语义一致性。
核心思路:借鉴人类的主动视觉机制,将元认知控制引入模型推理循环。通过模型自主生成的“注视标记”来动态调节注意力分配,实现对视觉特征的实时抑制与聚焦,从而在不依赖外部工具的情况下实现高效的局部推理。
技术框架:GazeVLM在推理过程中引入了
关键创新:核心创新在于将注意力控制机制内化为模型推理的一部分,而非依赖外部代理。这种“内部注意力控制”允许模型在全局感知与局部聚焦之间无缝切换,保持了空间推理的连贯性。
关键设计:模型采用定制的组相对策略优化(GRPO)进行训练,通过奖励机制强化模型对任务相关区域的有效定位(Grounding)。该设计确保了注视行为与推理任务的高度对齐,提升了模型在高分辨率场景下的鲁棒性。
🖼️ 关键图片
📊 实验亮点
GazeVLM在4B参数量级下展现出卓越性能。在HRBench-4k和HRBench-8k基准测试中,其推理准确率较同类SOTA模型提升近4%,相较于依赖外部图像裁剪或代理思考的复杂多模态管道,性能提升超过5%,证明了内部注意力控制机制在处理高分辨率视觉任务中的高效性与优越性。
🎯 应用场景
GazeVLM适用于需要高精度视觉定位与复杂推理的场景,如医疗影像诊断、工业精密质检、自动驾驶环境感知以及复杂文档分析。其无需外部工具的特性使其在资源受限的边缘计算设备上具有极高的部署价值,能够显著提升模型在处理高分辨率输入时的推理效率与准确性。
📄 摘要(原文)
Human visual reasoning is governed by active vision, a process where metacognitive control drives top-down goal-directed attention, dynamically routing foveal focus toward task-relevant details while maintaining peripheral awareness of the global scene. In contrast, modern Vision-Language Models (VLMs) process visual information passively, relying on the static accumulation of massive token contexts that dilute spatial reasoning and induce linguistic hallucinations. Here we propose the following paradigm shift: GazeVLM, a multimodal architecture that internalizes this metacognitive oversight over its deployment of attention resources directly into the reasoning loop. By empowering the VLM to autonomously generate gaze tokens ($\texttt{
}$), GazeVLM establishes a top-down control mechanism over its own causal attention mask. The model dynamically dictates its focal intent, triggering a continuous suppression bias that dampens irrelevant visual features, implementing spatial selective attention and simulating foveal fixation. Once local reasoning concludes, the bias lifts, seamlessly restoring the global view. This architecture enables the model to fluidly transition between global spatial awareness and localized focal reasoning without relying on external agentic contraptions like cropping tools, or inflating the context window with additional visual tokens derived from localized visual patches. Trained with a bespoke Group Relative Policy Optimization (GRPO) procedure that rewards valid grounding, our 4B-parameter GazeVLM delivers strong high-resolution multimodal reasoning performance, surpassing state-of-the-art VLMs in its parameter class by nearly 4% and agentic multimodal pipelines built around thinking with images by more than 5% on HRBench-4k and HRBench-8k.