From Data to Modeling: Fully Open-vocabulary Scene Graph Generation

作者: Zuyao Chen, Jinlin Wu, Zhen Lei, Chang Wen Chen

分类: cs.CV

发布日期: 2025-05-26

💡 一句话要点

提出OvSGTR以解决传统场景图生成的开放词汇问题

🎯 匹配领域: 支柱二：RL算法与架构 (RL & Architecture) 支柱三：空间感知与语义 (Perception & Semantics) 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 开放词汇 场景图生成 变换器架构 弱监督学习 关系感知预训练

📋 核心要点

现有的场景图生成方法通常限制在固定的词汇内，无法适应新兴概念，影响实际应用。
OvSGTR通过联合预测对象和关系，采用DETR-like架构和关系感知预训练策略，突破了传统模型的限制。
在VG150基准上，OvSGTR在闭集和开放词汇场景下均表现出色，展示了其在视觉理解中的潜力。

📝 摘要（中文）

我们提出了OvSGTR，这是一种新颖的基于变换器的框架，用于完全开放词汇的场景图生成，克服了传统闭集模型的局限性。传统方法将对象和关系识别限制在固定词汇内，限制了其在新概念频繁出现的现实场景中的适用性。我们的方案超越了预定义类别，联合预测对象（节点）及其相互关系（边）。OvSGTR利用类似DETR的架构，结合冻结的图像主干和文本编码器提取高质量的视觉和语义特征，并通过变换器解码器进行端到端的场景图预测。为增强模型对复杂视觉关系的理解，我们提出了一种关系感知的预训练策略，采用弱监督方式合成场景图注释。实验结果表明，OvSGTR在VG150基准上实现了多种设置下的最先进性能。

🔬 方法详解

问题定义：本论文旨在解决传统场景图生成模型在开放词汇场景中的局限性，现有方法无法处理新出现的对象和关系，限制了其应用范围。

核心思路：OvSGTR通过联合预测对象和关系，采用变换器架构，结合弱监督的关系感知预训练策略，增强模型对复杂视觉关系的理解。

技术框架：OvSGTR的整体架构包括一个冻结的图像主干和文本编码器，用于提取视觉和语义特征，随后通过变换器解码器进行场景图的端到端预测。

关键创新：本研究的核心创新在于提出了开放词汇的场景图生成方法，结合了关系感知的预训练策略和视觉概念保留机制，显著提升了模型在新概念场景下的表现。

关键设计：模型采用了DETR-like架构，结合了多种生成监督信号的管道，确保在微调过程中保留丰富的语义线索，同时引入知识蒸馏策略以应对灾难性遗忘问题。

📊 实验亮点

在VG150基准测试中，OvSGTR在多个设置下实现了最先进的性能，包括闭集和开放词汇对象检测、关系基础和完全开放词汇场景，展示了其在复杂视觉理解中的有效性和可靠性。

🎯 应用场景

该研究的潜在应用领域包括智能监控、自动驾驶、虚拟现实等，能够提升计算机视觉系统对复杂场景的理解能力，推动相关技术的实际应用和发展。未来，该方法可能在更广泛的视觉理解任务中发挥重要作用，促进开放词汇的智能系统构建。

📄 摘要（原文）

We present OvSGTR, a novel transformer-based framework for fully open-vocabulary scene graph generation that overcomes the limitations of traditional closed-set models. Conventional methods restrict both object and relationship recognition to a fixed vocabulary, hindering their applicability to real-world scenarios where novel concepts frequently emerge. In contrast, our approach jointly predicts objects (nodes) and their inter-relationships (edges) beyond predefined categories. OvSGTR leverages a DETR-like architecture featuring a frozen image backbone and text encoder to extract high-quality visual and semantic features, which are then fused via a transformer decoder for end-to-end scene graph prediction. To enrich the model's understanding of complex visual relations, we propose a relation-aware pre-training strategy that synthesizes scene graph annotations in a weakly supervised manner. Specifically, we investigate three pipelines--scene parser-based, LLM-based, and multimodal LLM-based--to generate transferable supervision signals with minimal manual annotation. Furthermore, we address the common issue of catastrophic forgetting in open-vocabulary settings by incorporating a visual-concept retention mechanism coupled with a knowledge distillation strategy, ensuring that the model retains rich semantic cues during fine-tuning. Extensive experiments on the VG150 benchmark demonstrate that OvSGTR achieves state-of-the-art performance across multiple settings, including closed-set, open-vocabulary object detection-based, relation-based, and fully open-vocabulary scenarios. Our results highlight the promise of large-scale relation-aware pre-training and transformer architectures for advancing scene graph generation towards more generalized and reliable visual understanding.

From Data to Modeling: Fully Open-vocabulary Scene Graph Generation

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册