DisTime: Distribution-based Time Representation for Video Large Language Models

作者: Yingsen Zeng, Zepeng Huang, Yujie Zhong, Chengjian Feng, Jie Hu, Lin Ma, Yang Liu

分类: cs.CV

发布日期: 2025-05-30 (更新: 2025-07-31)

备注: Accepted by ICCV 2025

🔗 代码/项目: GITHUB

💡 一句话要点

DisTime：面向视频大语言模型的基于分布的时间表示方法

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 视频大语言模型 时间表示学习 时间定位 概率分布 自动标注

📋 核心要点

现有Video-LLM在时间定位上存在不足，离散时间表示和缺乏时间感知数据是主要瓶颈。
DisTime通过可学习token构建连续时间嵌入，并利用分布解码器生成时间概率分布，解决边界模糊问题。
论文提出自动标注范式，构建了大规模时间对齐数据集InternVid-TG，并在多个时间敏感任务上取得SOTA。

📝 摘要（中文）

尽管通用视频理解取得了进展，但视频大语言模型（Video-LLM）在精确的时间定位方面面临挑战，这归因于离散的时间表示和有限的时间感知数据集。现有的时间表达方法要么将时间与基于文本的数值混淆，要么添加一系列专用的时间token，要么使用专门的时间定位头回归时间。为了解决这些问题，我们引入了DisTime，这是一个轻量级框架，旨在增强Video-LLM中的时间理解能力。DisTime采用可学习的token来创建连续的时间嵌入空间，并结合基于分布的时间解码器，生成时间概率分布，从而有效地缓解边界模糊并保持时间连续性。此外，基于分布的时间编码器重新编码时间戳，为Video-LLM提供时间标记。为了克服现有数据集中的时间粒度限制，我们提出了一种自动标注范式，该范式结合了Video-LLM的字幕生成能力和专用时间模型的定位专业知识。这促成了InternVid-TG的创建，这是一个包含125万个时间对齐事件的大型数据集，涵盖17.9万个视频，比ActivityNet-Caption多55倍。大量的实验表明，DisTime在三个时间敏感任务的基准测试中实现了最先进的性能，同时在Video QA任务中保持了具有竞争力的性能。

🔬 方法详解

问题定义：Video-LLM在视频理解中，尤其是在时间定位任务中表现不佳。现有方法要么将时间视为数值，要么使用离散的token表示，无法有效捕捉时间的连续性和模糊性。同时，缺乏大规模、高质量的时间标注数据集也限制了模型性能的提升。

核心思路：DisTime的核心思路是将时间表示为连续的概率分布，而非离散的数值或token。通过学习一个连续的时间嵌入空间，并使用基于分布的解码器预测时间概率分布，模型可以更好地处理时间边界的模糊性，并捕捉时间的连续变化。同时，利用Video-LLM的生成能力和时间模型的定位能力，自动生成大规模的时间标注数据。

技术框架：DisTime框架主要包含三个部分：1) Distribution-based Time Encoder：将时间戳重新编码为适合Video-LLM的时间标记。2) Learnable Temporal Token Embedding：学习一个可学习的token，将其嵌入到连续的时间嵌入空间中。3) Distribution-based Time Decoder：基于时间嵌入，生成时间概率分布，用于时间定位。整个框架可以端到端地训练。

关键创新：DisTime的关键创新在于其基于分布的时间表示方法。与传统的离散时间表示方法相比，DisTime能够更好地处理时间边界的模糊性，并捕捉时间的连续变化。此外，自动标注范式的提出，解决了大规模时间标注数据的获取难题。

关键设计：Distribution-based Time Decoder使用一个多层感知机（MLP）将时间嵌入映射到时间概率分布。损失函数采用交叉熵损失，鼓励模型预测的时间概率分布与真实的时间分布尽可能接近。自动标注范式中，利用Video-LLM生成视频描述，然后使用时间模型将描述中的时间信息与视频片段对齐，从而生成时间标注数据。

🖼️ 关键图片

📊 实验亮点

DisTime在多个时间敏感任务的基准测试中取得了SOTA性能。例如，在时间定位任务中，DisTime的准确率显著优于现有方法。此外，通过自动标注范式构建的InternVid-TG数据集，规模远超现有数据集，为Video-LLM的时间理解提供了有力支持。

🎯 应用场景

DisTime的潜在应用领域包括视频内容分析、视频检索、智能监控、自动驾驶等。通过提高Video-LLM的时间理解能力，可以实现更精确的视频事件定位、更智能的视频内容推荐以及更可靠的视频监控系统。该研究有助于推动视频理解技术的发展，并为相关应用带来实际价值。

📄 摘要（原文）

Despite advances in general video understanding, Video Large Language Models (Video-LLMs) face challenges in precise temporal localization due to discrete time representations and limited temporally aware datasets. Existing methods for temporal expression either conflate time with text-based numerical values, add a series of dedicated temporal tokens, or regress time using specialized temporal grounding heads. To address these issues, we introduce DisTime, a lightweight framework designed to enhance temporal comprehension in Video-LLMs. DisTime employs a learnable token to create a continuous temporal embedding space and incorporates a Distribution-based Time Decoder that generates temporal probability distributions, effectively mitigating boundary ambiguities and maintaining temporal continuity. Additionally, the Distribution-based Time Encoder re-encodes timestamps to provide time markers for Video-LLMs. To overcome temporal granularity limitations in existing datasets, we propose an automated annotation paradigm that combines the captioning capabilities of Video-LLMs with the localization expertise of dedicated temporal models. This leads to the creation of InternVid-TG, a substantial dataset with 1.25M temporally grounded events across 179k videos, surpassing ActivityNet-Caption by 55 times. Extensive experiments demonstrate that DisTime achieves state-of-the-art performance across benchmarks in three time-sensitive tasks while maintaining competitive performance in Video QA tasks. Code and data are released at https://github.com/josephzpng/DisTime.

DisTime: Distribution-based Time Representation for Video Large Language Models

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理