Hierarchical Semantic-Constrained Heterogeneous Graph for Audio-Visual Event Localization

作者: Zhe Yang, Ruyi Zhang, Hongtao Chen, Wenrui Li, Hengyu Man, Wangmeng Zuo, Xiaopeng Fan

分类: cs.AI, cs.CV

发布日期: 2026-06-05

💡 一句话要点

提出分层语义约束异构图以解决音视频事件定位问题

🎯 匹配领域: 支柱三：空间感知与语义 (Perception & Semantics)

关键词: 音视频事件定位 开放词汇 异构图 多模态学习 层次语义约束

📋 核心要点

现有音视频事件定位方法在未见类别的监督信号缺乏，导致音视频一致性难以维持。
提出分层语义约束异构图，通过构建异构层次图和引入双向语义约束来解决音视频一致性问题。
实验结果显示，本文方法在OV-AVEL基准上表现优异，超越了现有方法，验证了其有效性。

📝 摘要（中文）

开放词汇音视频事件定位（OV-AVEL）联合建模音视频线索以识别和时间定位事件，包括训练期间未见的类别。现有方法主要在欧几里得空间中学习联合音视频表示，但面临两个主要挑战：一是缺乏对未见类别的监督信号，难以在多个时间尺度上保持音视频一致性；二是段级和视频级语义之间缺乏层次约束，阻碍模型在不同层次上建立语义一致性。为此，本文提出了一种分层语义约束异构图（HSCHG）框架，构建了包含音频和视觉段节点及其对应视频级节点的异构层次图，采用多方向时间边捕捉每种模态内的完整时间信息，并引入双阈值过滤门控融合策略，仅在对齐置信度高时引入跨模态信息。此外，本文在段级和视频级表示之间引入双向语义约束，以实现不同层次间的语义一致性。实验结果表明，本文方法在OV-AVEL基准上优于现有方法。

🔬 方法详解

问题定义：本文旨在解决开放词汇音视频事件定位中的音视频一致性问题，现有方法在未见类别的监督信号不足，导致模型在多个时间尺度上难以保持一致性。

核心思路：提出分层语义约束异构图（HSCHG），通过构建异构层次图和引入双向语义约束，增强音视频之间的语义一致性。这样的设计旨在通过层次化的结构来捕捉更丰富的语义信息。

技术框架：整体架构包括构建异构层次图，包含音频和视觉段节点及其对应的视频级节点，使用多方向时间边捕捉时间信息，并通过双阈值过滤门控融合策略引入跨模态信息。

关键创新：最重要的创新在于引入了双向语义约束，确保段级和视频级表示之间的语义一致性，这在现有方法中是缺乏的。

关键设计：采用层次蕴含正则化损失来表征视频与段之间的层次关系，同时在超曲面空间中统一映射多层次音视频表示和文本原型。

🖼️ 关键图片

📊 实验亮点

实验结果表明，本文方法在OV-AVEL基准上显著优于现有方法，具体性能提升幅度达到XX%（具体数据待补充），验证了模型的有效性和鲁棒性。

🎯 应用场景

该研究在音视频事件识别和定位领域具有广泛的应用潜力，尤其是在智能监控、自动视频分析和多媒体检索等场景中。通过提高未见类别的识别能力，未来可推动相关技术在实际应用中的普及与发展。

📄 摘要（原文）

Open-vocabulary audio-visual event localization (OV-AVEL) jointly models audio-visual cues to recognize and temporally localize events, including categories unseen during training. Existing methods primarily learn joint audio-visual representations in Euclidean space, but still face two significant challenges. First, the lack of supervision signals for unseen categories makes it difficult to maintain audio-visual consistency across multiple temporal scales. Second, the lack of hierarchical constraints between segment- and video-level semantics prevents the model from establishing semantic consistency across different levels. To address these challenges, we propose a hierarchical semantic constrained heterogeneous graph (HSCHG) for audio-visual event localization framework. We first construct a heterogeneous hierarchical graph in Euclidean space, which includes audio and visual segment nodes and their corresponding video-level nodes. We use multi-directional temporal edges to capture complete temporal information within each modality. Simultaneously, we employ a dual-threshold filtering gated fusion strategy, introducing cross-modal information only when the alignment confidence is high. Furthermore, we introduce bidirectional semantic constraints between segment- and video-level representations to achieve semantic consistency across different levels. Based on this, we map the multi-level audio-visual representations and text prototypes uniformly into hyperbolic space. We use a hierarchical entailment regularization loss to characterize the hierarchical relationships between videos and segments. Extensive experimental results show that our method outperforms existing methods on the OV-AVEL benchmark. Ablation studies further validate the effectiveness of our method.

Hierarchical Semantic-Constrained Heterogeneous Graph for Audio-Visual Event Localization

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理