From Semantics, Scene to Instance-awareness: Distilling Foundation Model for Grounded Open-vocabulary Situation Recognition

作者: Chen Cai, Tianyi Liu, Jianjun Gao, Wenyang Liu, Kejun Wu, Ruoyu Wang, Yi Wang, Soo Chin Liew

分类: cs.CV

发布日期: 2025-07-19 (更新: 2025-11-11)

💡 一句话要点

提出多模态互动提示蒸馏方法以提升开放词汇情境识别能力

🎯 匹配领域: 支柱二：RL算法与架构 (RL & Architecture) 支柱三：空间感知与语义 (Perception & Semantics) 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 多模态学习 情境识别 知识蒸馏 开放词汇 零样本学习 边缘计算 深度学习

📋 核心要点

现有多模态大型语言模型在复杂情境识别任务中表现不佳，且传统GSR模型缺乏泛化能力，难以处理未见和稀有情境。
本文提出多模态互动提示蒸馏（MIPD）框架，通过从教师MLLM中提取知识，增强学生Ov-GSR模型的泛化和零样本能力。
在Ov-SWiG数据集上进行评估，MIPD在已见、稀有和未见情境上均取得了优异的性能，并在HICO-DET数据集上显示出未见检测的提升。

📝 摘要（中文）

近年来，多模态大型语言模型（MLLMs）展现出强大的零样本能力，但在复杂的基础情境识别（GSR）任务中表现不佳，且在边缘设备部署时资源消耗较大。传统GSR模型通常缺乏泛化能力，难以识别未见和稀有情境。本文通过从教师MLLM向小型GSR模型转移知识，提出开放词汇基础情境识别（Ov-GSR）任务。我们提出了多模态互动提示蒸馏（MIPD）框架，旨在增强学生Ov-GSR模型的泛化和零样本能力。MIPD框架利用基于LLM的判断性理由生成器（JRG）构建丰富的上下文语义信息，并通过负引导多模态提示对齐（NMPA）模块将这些信息与视觉信息对齐，最终将蒸馏的多模态知识应用于学生模型，显著提升了对已见、稀有和未见情境的识别能力。

🔬 方法详解

问题定义：本文旨在解决传统GSR模型在识别未见和稀有情境时的泛化能力不足问题。现有方法在复杂情境识别中表现不佳，且资源消耗较高。

核心思路：通过多模态互动提示蒸馏（MIPD）框架，从教师MLLM中提取丰富的多模态知识，增强学生Ov-GSR模型的识别能力，特别是在未见和稀有情境下的表现。

技术框架：MIPD框架包括几个主要模块：首先，利用判断性理由生成器（JRG）构建正负样本的上下文语义信息；其次，通过负引导多模态提示对齐（NMPA）模块将这些信息与视觉信息对齐；最后，将对齐的多模态知识蒸馏到学生模型中。

关键创新：MIPD框架的创新在于通过引入判断性理由生成器和负引导对齐模块，有效捕捉整体和感知的多模态知识，从而提升模型的泛化能力。

关键设计：在设计中，JRG生成的理由信息丰富了上下文语义，NMPA模块通过对齐机制确保了视觉信息与理由信息的有效结合，最终通过蒸馏过程将这些知识传递给学生模型。具体的损失函数和网络结构细节在论文中详细描述。

🖼️ 关键图片

📊 实验亮点

在Ov-SWiG数据集上，MIPD框架在已见、稀有和未见情境的识别性能上均显著优于基线方法，特别是在未见情境的检测上，提升幅度达到XX%。在HICO-DET数据集上也展示了更好的未见检测能力，验证了方法的有效性。

🎯 应用场景

该研究的潜在应用领域包括智能监控、自动驾驶、机器人交互等场景，能够有效提升系统对复杂情境的理解能力，尤其是在处理未见和稀有情境时的表现。未来，该方法有望在边缘设备上实现更高效的情境识别，推动智能系统的普及与应用。

📄 摘要（原文）

Recent Multimodal Large Language Models (MLLMs) exhibit strong zero-shot abilities but struggle with complex Grounded Situation Recognition (GSR) and are resource-intensive for edge device deployment. Meanwhile, conventional GSR models often lack generalization ability, falling short in recognizing unseen and rare situations. In this paper, we exploit transferring knowledge from a teacher MLLM to a small GSR model to enhance its generalization and zero-shot abilities, thereby introducing the task of Open-vocabulary Grounded Situation Recognition (Ov-GSR). To achieve this, we propose Multimodal Interactive Prompt Distillation (MIPD), a novel framework that distills enriched multimodal knowledge from the foundation model, enabling the student Ov-GSR model to recognize unseen situations and be better aware of rare situations. Specifically, the MIPD framework first leverages the LLM-based Judgmental Rationales Generator (JRG) to construct positive and negative glimpse and gaze rationales enriched with contextual semantic information. The proposed scene-aware and instance-perception prompts are then introduced to align rationales with visual information from the MLLM teacher via the Negative-Guided Multimodal Prompting Alignment (NMPA) module, effectively capturing holistic and perceptual multimodal knowledge. Finally, the aligned multimodal knowledge is distilled into the student Ov-GSR model, providing a stronger foundation for generalization that enhances situation understanding, bridges the gap between seen and unseen scenarios, and mitigates prediction bias in rare cases. We evaluate MIPD on the refined Ov-SWiG dataset, achieving superior performance on seen, rare, and unseen situations, and further demonstrate improved unseen detection on the HICO-DET dataset.

From Semantics, Scene to Instance-awareness: Distilling Foundation Model for Grounded Open-vocabulary Situation Recognition

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理