CAPTURe: Evaluating Spatial Reasoning in Vision Language Models via Occluded Object Counting
作者: Atin Pothiraj, Elias Stengel-Eskin, Jaemin Cho, Mohit Bansal
分类: cs.CV, cs.AI, cs.CL
发布日期: 2025-04-21 (更新: 2025-08-13)
备注: ICCV 2025
🔗 代码/项目: GITHUB
💡 一句话要点
提出CAPTURe基准测试,评估视觉语言模型在遮挡场景下的空间推理能力
🎯 匹配领域: 支柱二:RL算法与架构 (RL & Architecture) 支柱七:动作重定向 (Motion Retargeting)
关键词: 视觉语言模型 空间推理 遮挡处理 物体计数 模式识别
📋 核心要点
- 现有视觉语言模型在理解真实场景中的空间关系方面存在挑战,尤其是在物体被部分或完全遮挡时,模型难以进行准确的空间推理。
- CAPTURe任务通过要求模型推断遮挡物后方的模式延续来计数物体,从而评估模型在遮挡场景下的空间推理和模式识别能力。
- 实验结果表明,即使是强大的视觉语言模型在CAPTURe任务中也表现不佳,尤其是在存在遮挡的情况下,这突显了模型在空间推理方面的不足。
📝 摘要(中文)
本文提出了一项名为CAPTURe(Counting Amodally for Patterns Through Unseen REgions)的新任务,旨在评估视觉语言模型(VLM)在遮挡场景下进行空间推理的能力。该任务要求模型通过推断遮挡物后方的模式延续来计数排列成特定模式的物体。CAPTURe包含两个部分:CAPTURe-real,包含手动过滤的真实物体图像;CAPTURe-synthetic,包含生成的图案图像,用于可控诊断。研究评估了四个强大的VLM(GPT-4o、Intern-VL2、Molmo和Qwen2-VL)在CAPTURe上的表现,发现模型在遮挡和非遮挡模式计数方面均表现不佳。更重要的是,遮挡的存在会显著降低模型的性能,表明VLM在推断未见空间关系方面存在缺陷。即使是GPT-4o等最强大的VLM也难以应对遮挡场景下的计数任务。相比之下,人类在CAPTURe上的错误率非常低。提供遮挡物体位置的辅助信息可以提高模型性能,这表明模型误差既来自于处理遮挡的能力不足,也来自于图像计数的困难。
🔬 方法详解
问题定义:论文旨在解决视觉语言模型(VLM)在理解和推理被遮挡物体构成的空间模式方面的不足。现有方法在处理真实世界场景中常见的遮挡问题时,无法准确推断被遮挡部分的信息,导致空间理解能力下降。
核心思路:论文的核心思路是通过构建一个专门用于评估VLM在遮挡场景下空间推理能力的任务——CAPTURe。该任务要求模型根据可见的物体排列模式,推断被遮挡区域的物体数量,从而考察模型是否具备“填充”缺失信息的能力。
技术框架:CAPTURe任务包含两个数据集:CAPTURe-real和CAPTURe-synthetic。CAPTURe-real包含真实场景中被遮挡的物体图像,而CAPTURe-synthetic则包含程序生成的、具有明确模式的图像,用于进行可控的诊断分析。模型接收图像和问题(例如“有多少个物体?”),并输出答案。
关键创新:CAPTURe任务的创新之处在于其专注于评估VLM在遮挡场景下的空间推理能力,这与以往的视觉语言任务有所不同。它不仅考察了模型的视觉识别能力,还考察了模型对空间关系的理解和推理能力。
关键设计:CAPTURe-synthetic数据集的设计允许研究人员控制遮挡的程度、物体排列的模式等因素,从而更精确地诊断VLM的缺陷。评估指标主要为计数准确率,即模型预测的物体数量与真实数量之间的差异。
🖼️ 关键图片
📊 实验亮点
实验结果表明,即使是GPT-4o等最先进的VLM在CAPTURe任务中也表现不佳,尤其是在存在遮挡的情况下。模型在CAPTURe-real和CAPTURe-synthetic上的计数准确率均低于人类水平。提供遮挡物体位置的辅助信息可以显著提高模型性能,这表明模型误差既来自于处理遮挡的能力不足,也来自于图像计数的困难。
🎯 应用场景
该研究成果可应用于机器人导航、自动驾驶、智能监控等领域。例如,在自动驾驶中,车辆需要能够识别被其他车辆或物体遮挡的行人或障碍物。通过提高VLM在遮挡场景下的空间推理能力,可以提升这些系统的安全性和可靠性,并促进更智能的场景理解。
📄 摘要(原文)
Recognizing and reasoning about occluded (partially or fully hidden) objects is vital to understanding visual scenes, as occlusions frequently occur in real-world environments and act as obstacles for spatial comprehension. To test models' ability to reason about multiple occluded objects, we introduce a novel task, Counting Amodally for Patterns Through Unseen REgions (CAPTURe), which requires a model to count objects arranged in a pattern by inferring how the pattern continues behind an occluder (an object which blocks parts of the scene). CAPTURe requires both recognizing visual patterns and reasoning, making it a useful testbed for evaluating vision-language models (VLMs) on whether they understand occluded patterns and possess spatial understanding skills. By requiring models to reason about occluded objects, CAPTURe also tests VLMs' ability to form world models that would allow them to fill in missing information. CAPTURe consists of two parts: (1) CAPTURe-real, with manually filtered images of real objects in patterns and (2) CAPTURe-synthetic, a controlled diagnostic with generated patterned images. We evaluate four strong VLMs (GPT-4o, Intern-VL2, Molmo, and Qwen2-VL) on CAPTURe, finding that models struggle to count on both occluded and unoccluded patterns. Crucially, we find that models perform worse with occlusion, suggesting that VLMs are also deficient in inferring unseen spatial relationships: even the strongest VLMs like GPT-4o fail to count with occlusion. In contrast, we find that humans achieve very little error on CAPTURe. We also find that providing auxiliary information of occluded object locations increases performance, underscoring that the model error comes both from an inability to handle occlusion as well as difficulty in counting in images. Code and data: https://github.com/atinpothiraj/CAPTURe