STORM: Token-Efficient Long Video Understanding for Multimodal LLMs

作者: Jindong Jiang, Xiuyu Li, Zhijian Liu, Muyang Li, Guo Chen, Zhiqi Li, De-An Huang, Guilin Liu, Zhiding Yu, Kurt Keutzer, Sungjin Ahn, Jan Kautz, Hongxu Yin, Yao Lu, Song Han, Wonmin Byeon

分类: cs.CV

发布日期: 2025-03-06 (更新: 2025-09-22)

💡 一句话要点

提出STORM以解决长视频理解中的时间建模不足问题

🎯 匹配领域: 支柱二：RL算法与架构 (RL & Architecture) 支柱八：物理动画 (Physics-based Animation) 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 长视频理解 多模态LLMs 时间编码器 动态模式捕捉 计算效率 视频推理 Mamba状态空间模型 令牌减少策略

📋 核心要点

现有方法在处理长视频时，往往独立对待每一帧，缺乏有效的时间建模，导致动态模式捕捉能力不足。
STORM通过引入专用的时间编码器，结合Mamba状态空间模型，将时间信息融入图像令牌，增强了视频表示的丰富性。
STORM在多个长视频理解基准上实现了超过5%的性能提升，同时计算成本降低了8倍，解码延迟减少了2.4-2.9倍。

📝 摘要（中文）

近年来，基于视频的多模态大语言模型（Video-LLMs）在视频理解方面取得了显著进展，但许多现有方法在视觉主干中独立处理帧，缺乏明确的时间建模，限制了其捕捉动态模式的能力。为了解决这些局限性，本文提出了STORM（多模态LLMs的时空令牌减少），一种新颖的架构，在图像编码器和LLM之间引入了专用的时间编码器。该时间编码器利用Mamba状态空间模型将时间信息整合到图像令牌中，生成丰富的表示，保留整个视频序列中的帧间动态。这种增强编码不仅提升了视频推理能力，还实现了有效的令牌减少策略，显著降低了LLM的计算需求，同时保持关键的时间信息。大量评估表明，STORM在多个长视频理解基准上取得了领先的结果，同时计算成本降低了多达8倍，解码延迟降低了2.4-2.9倍。

🔬 方法详解

问题定义：本文旨在解决现有长视频理解方法中缺乏有效时间建模的问题。现有方法通常将视频帧独立处理，无法充分捕捉帧间的动态关系，限制了其在长视频场景下的表现。

核心思路：STORM的核心思路是引入一个专用的时间编码器，利用Mamba状态空间模型将时间信息整合到图像令牌中，从而生成更丰富的表示。这种设计旨在增强视频的时序理解能力，提升模型的推理效果。

技术框架：STORM的整体架构包括三个主要模块：图像编码器、时间编码器和多模态大语言模型（LLM）。图像编码器负责提取每帧的特征，时间编码器则将这些特征进行时序整合，最后将处理后的信息输入到LLM中进行进一步推理。

关键创新：STORM的主要创新在于引入了专用的时间编码器，利用Mamba状态空间模型有效整合时间信息。这一设计与传统方法的根本区别在于，STORM能够在处理长视频时保留帧间的动态关系，从而提升理解能力。

关键设计：在关键设计方面，STORM采用了多种令牌减少策略，包括测试时采样和基于训练的时间与空间池化。这些策略有效降低了计算需求，同时保持了重要的时间信息，确保了模型在推理过程中的高效性。

🖼️ 关键图片

📊 实验亮点

STORM在多个长视频理解基准上取得了领先的实验结果，尤其在MLVU和LongVideoBench上实现了超过5%的性能提升。同时，STORM显著降低了计算成本，最高可达8倍，解码延迟减少了2.4-2.9倍，展现了其在效率与性能上的双重优势。

🎯 应用场景

STORM的研究成果在多个领域具有广泛的应用潜力，包括视频监控、自动驾驶、智能家居等场景。通过提升长视频理解的效率和准确性，STORM能够为实时视频分析提供更强大的支持，推动相关技术的发展与应用。此外，该方法的高效性也为资源受限的设备提供了可行的解决方案，具有重要的实际价值。

📄 摘要（原文）

Recent advances in video-based multimodal large language models (Video-LLMs) have significantly improved video understanding by processing videos as sequences of image frames. However, many existing methods treat frames independently in the vision backbone, lacking explicit temporal modeling, which limits their ability to capture dynamic patterns and efficiently handle long videos. To address these limitations, we introduce STORM (Spatiotemporal TOken Reduction for Multimodal LLMs), a novel architecture incorporating a dedicated temporal encoder between the image encoder and the LLM. Our temporal encoder leverages the Mamba State Space Model to integrate temporal information into image tokens, generating enriched representations that preserve inter-frame dynamics across the entire video sequence. This enriched encoding not only enhances video reasoning capabilities but also enables effective token reduction strategies, including test-time sampling and training-based temporal and spatial pooling, substantially reducing computational demands on the LLM without sacrificing key temporal information. By integrating these techniques, our approach simultaneously reduces training and inference latency while improving performance, enabling efficient and robust video understanding over extended temporal contexts. Extensive evaluations show that STORM achieves state-of-the-art results across various long video understanding benchmarks (more than 5% improvement on MLVU and LongVideoBench) while reducing the computation costs by up to $8\times$ and the decoding latency by 2.4-2.9$\times$ for the fixed numbers of input frames. Project page is available at https://research.nvidia.com/labs/lpr/storm

STORM: Token-Efficient Long Video Understanding for Multimodal LLMs

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理