SVHighlights: Towards Extremely Long Sport Video Highlight Detection
作者: Donggyu Lee, Youngbin Ki, Jeonghun Kang, Taehwan Kim
分类: cs.CV, cs.MM
发布日期: 2026-06-05
备注: Accepted to KDD 2026 (Datasets and Benchmarks Track). Project Page: https://leedongkyu2019.github.io/SVHighlights/
💡 一句话要点
提出SVHighlights以解决长视频高亮检测问题
🎯 匹配领域: 支柱九:具身大模型 (Embodied Foundation Models)
关键词: 长视频理解 高亮检测 多模态融合 段落划分 数据集构建
📋 核心要点
- 现有的高亮检测方法主要针对短视频,缺乏对长视频的有效支持,导致模型泛化能力不足。
- 论文提出SVHighlights基准,并引入TF-SELECTOR方法,通过上下文感知的段落划分来提升高亮检测效果。
- 实验结果显示,TF-SELECTOR在HIT@1、HIT@K和IoU等指标上均有显著提升,验证了其有效性。
📝 摘要(中文)
高亮检测在长视频中的应用具有重要的实际意义,但现有方法多局限于短视频内容,缺乏适当的基准。为此,我们提出SVHighlights,这是首个针对超过一小时的长体育视频的高亮检测基准。SVHighlights通过全长体育视频与其对应的官方高亮视频对构建,采用数据集生成管道,实现可扩展的标签生成。该基准包含320个视频,总时长达640.18小时,显著超越以往数据集。为应对长视频的挑战,我们提出TF-SELECTOR,这是一种无训练的基于段落的方法,通过合并相邻镜头生成上下文感知的段落,并利用大型语言模型预测段落级显著性分数。实验表明,TF-SELECTOR在大多数指标上优于VTG调优基线,提升显著。
🔬 方法详解
问题定义:论文旨在解决长视频高亮检测的挑战,现有方法在处理超过一小时的视频时表现不佳,无法有效识别高亮内容。
核心思路:提出SVHighlights基准和TF-SELECTOR方法,通过上下文感知的段落划分来提高长视频的高亮检测能力,避免传统短片训练的局限性。
技术框架:TF-SELECTOR将视频划分为多个上下文感知的段落,利用大型语言模型处理多模态输入(如视觉字幕、转录文本和音频音量),并预测每个段落的显著性分数。
关键创新:SVHighlights是首个针对极长体育视频的高亮检测基准,TF-SELECTOR方法通过段落划分和多模态输入的结合,显著提升了长视频的高亮检测性能。
关键设计:TF-SELECTOR不依赖于传统的训练过程,采用段落级别的显著性评分,设计了合并相邻镜头的策略,以确保上下文信息的完整性。具体参数设置和损失函数未在摘要中详细说明,需参考完整论文。
🖼️ 关键图片
📊 实验亮点
实验结果表明,TF-SELECTOR在HIT@1、HIT@K和IoU指标上分别提升了3.12、4.06和2.95,优于现有的VTG调优基线,验证了其在长视频高亮检测中的有效性和优势。
🎯 应用场景
该研究的潜在应用领域包括体育赛事回放、视频摘要生成和内容推荐系统。通过准确识别长视频中的高亮部分,可以提升观众的观看体验,并为内容创作者提供更高效的编辑工具,具有重要的实际价值和未来影响。
📄 摘要(原文)
While highlight detection for long-form videos is of great practical importance, most existing methods remain limited to short-form content, largely due to the absence of a suitable benchmark. To bridge this gap, we introduce SVHighlights, to the best of our knowledge, the first benchmark for highlight detection in extremely long sports videos, each exceeding one hour in duration, across multiple sports categories. SVHighlights is constructed from pairs of full-length sports videos and their corresponding official highlight videos using a dataset generation pipeline, enabling scalable label generation without conventional per-clip saliency annotation. The benchmark comprises 320 videos with an average duration of 2.00 hours and a total of 640.18 hours, substantially exceeding previous datasets. Existing methods also face fundamental challenges on long videos: models trained on short clips fail to generalize to hour-long content, and their clip-level scoring lacks the broader context needed to identify highlights. To address this and provide a strong baseline, we present TF-SELECTOR, a training-free segment-based approach that divides each video into context-aware segments by merging adjacent shots sharing the same semantic content, and predicts segment-level saliency scores using a large language model with multimodal inputs including visual captions, transcripts, and audio volume. Experiments demonstrate that TF-SELECTOR achieves superior performance across most metrics compared to Video Temporal Grounding (VTG)-tuned baselines, with improvements of +3.12 in HIT@1, +4.06 in HIT@K, and +2.95 in IoU. These results establish SVHighlights as a challenging testbed for long-form highlight detection and demonstrate that a simple segment-based strategy can effectively scale to hour-long videos.