Boosting Gaze Object Prediction via Pixel-level Supervision from Vision Foundation Model

作者: Yang Jin, Lei Zhang, Shi Yan, Bin Fan, Binglu Wang

分类: cs.CV

发布日期: 2024-08-02

备注: Accepted by ECCV2024

💡 一句话要点

利用视觉基础模型的像素级监督提升注视对象预测性能

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 注视对象预测 视觉基础模型 像素级监督 语义分割 人机交互

📋 核心要点

现有注视对象预测方法依赖框级标注，易受语义模糊影响，限制了预测精度。
论文提出利用视觉基础模型提供的像素级监督信息，提升注视对象分割的准确性。
实验表明，该方法在GOO-Synth和GOO-Real数据集上表现出色，验证了其有效性。

📝 摘要（中文）

注视对象预测(GOP)旨在预测人类正在注视的对象的类别和位置。以往的方法利用框级监督来识别注视对象，但存在语义模糊问题，即单个框可能包含多个对象。视觉基础模型(VFM)通过框提示改进了对象分割，能够更精确地定位对象，从而减少混淆，为注视对象的精细预测提供了优势。本文提出了更具挑战性的注视对象分割(GOS)任务，即推断与人类注视行为相对应的像素级掩码。特别地，我们提出将VFM提供的像素级监督集成到注视对象预测中，以减轻语义模糊。由此产生了我们的注视对象检测和分割框架，能够进行精确的像素级预测。与以往需要额外头部输入或忽略头部特征的方法不同，我们提出从场景特征中自动获取头部特征，以确保模型在现实世界中的推理效率和灵活性。此外，我们开发了一种空间到对象的注视回归方法，而不是像现有方法那样直接融合特征来预测注视热图，这可能会忽略对象的空间位置和细微细节，从而促进人与对象之间的注视交互。在GOO-Synth和GOO-Real数据集上的大量实验证明了我们方法的有效性。

🔬 方法详解

问题定义：现有的注视对象预测方法主要依赖于框级别的标注信息，这种标注方式存在固有的语义模糊性。一个框可能包含多个紧密相邻的对象，导致模型难以准确判断用户真正注视的目标。此外，一些方法需要额外的头部信息输入，或者忽略头部特征，限制了模型的实际应用。

核心思路：论文的核心思路是利用视觉基础模型（VFM）提供的像素级监督信息，更精确地定位注视对象。VFM在对象分割方面表现出色，能够通过框提示生成高质量的像素级掩码，从而有效缓解语义模糊问题。同时，论文提出从场景特征中自动提取头部特征，避免了对额外头部信息的依赖，提高了模型的实用性。

技术框架：该框架包含注视对象检测和分割两个分支。首先，利用场景特征自动提取头部特征。然后，通过空间到对象的注视回归方法，建立人与对象之间的初始空间连接。接着，利用分割分支提供的语义清晰的特征，对该连接进行细化。最后，预测注视热图，实现精确的注视对象定位。

关键创新：该方法的主要创新点在于：1) 引入了视觉基础模型的像素级监督，解决了框级标注的语义模糊问题；2) 提出了自动提取头部特征的方法，无需额外头部信息输入；3) 设计了空间到对象的注视回归方法，更好地捕捉人与对象之间的注视交互。

关键设计：空间到对象的注视回归方法是关键设计之一。该方法首先构建初始的人-对象空间连接，然后通过与分割分支的特征交互来细化该连接。损失函数的设计也至关重要，需要平衡检测和分割两个分支的性能，并确保生成的注视热图的准确性。具体的网络结构细节（如卷积层数、通道数等）以及训练参数（如学习率、batch size等）需要在实验中进行调整。

🖼️ 关键图片

📊 实验亮点

该方法在GOO-Synth和GOO-Real数据集上进行了广泛的实验，结果表明该方法能够显著提升注视对象预测的准确性。相较于基线方法，该方法在两个数据集上都取得了明显的性能提升，验证了像素级监督和空间到对象注视回归方法的有效性。具体的性能数据（如mAP、IoU等）需要在论文中查找。

🎯 应用场景

该研究成果可应用于人机交互、眼动追踪辅助驾驶、广告推荐、用户行为分析等领域。通过精确预测用户的注视对象，可以实现更自然、智能的人机交互方式，提升用户体验。在辅助驾驶领域，可以帮助系统更好地理解驾驶员的意图，提高驾驶安全性。在广告推荐领域，可以根据用户的注视行为，推送更相关的广告内容。

📄 摘要（原文）

Gaze object prediction (GOP) aims to predict the category and location of the object that a human is looking at. Previous methods utilized box-level supervision to identify the object that a person is looking at, but struggled with semantic ambiguity, ie, a single box may contain several items since objects are close together. The Vision foundation model (VFM) has improved in object segmentation using box prompts, which can reduce confusion by more precisely locating objects, offering advantages for fine-grained prediction of gaze objects. This paper presents a more challenging gaze object segmentation (GOS) task, which involves inferring the pixel-level mask corresponding to the object captured by human gaze behavior. In particular, we propose that the pixel-level supervision provided by VFM can be integrated into gaze object prediction to mitigate semantic ambiguity. This leads to our gaze object detection and segmentation framework that enables accurate pixel-level predictions. Different from previous methods that require additional head input or ignore head features, we propose to automatically obtain head features from scene features to ensure the model's inference efficiency and flexibility in the real world. Moreover, rather than directly fuse features to predict gaze heatmap as in existing methods, which may overlook spatial location and subtle details of the object, we develop a space-to-object gaze regression method to facilitate human-object gaze interaction. Specifically, it first constructs an initial human-object spatial connection, then refines this connection by interacting with semantically clear features in the segmentation branch, ultimately predicting a gaze heatmap for precise localization. Extensive experiments on GOO-Synth and GOO-Real datasets demonstrate the effectiveness of our method.

Boosting Gaze Object Prediction via Pixel-level Supervision from Vision Foundation Model

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理