HoliTom: Holistic Token Merging for Fast Video Large Language Models

作者: Kele Shao, Keda Tao, Can Qin, Haoxuan You, Yang Sui, Huan Wang

分类: cs.CV

发布日期: 2025-05-27 (更新: 2025-10-10)

备注: code link: https://github.com/cokeshao/HoliTom

💡 一句话要点

HoliTom：面向快速视频大语言模型的整体Token合并方法

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 视频大语言模型 Token合并 模型压缩 计算效率 时空冗余

📋 核心要点

现有视频LLM计算效率低，主要由于视频token冗余，而现有剪枝方法要么开销大，要么忽略了全局时序信息。
HoliTom提出一种免训练的整体token合并框架，包含全局冗余感知的outer-LLM时序分割和inner-LLM token相似性合并。
实验表明，HoliTom在保持性能的同时，显著降低了计算成本，TTFT降低2.28倍，解码吞吐量加速1.32倍。

📝 摘要（中文）

视频大语言模型(Video LLMs)在视频理解方面表现出色，但由于冗余的视频token，计算效率显著降低。现有的token剪枝方法提供了解决方案。然而，在LLM内部进行token剪枝的方法(inner-LLM pruning)，例如FastV，在浅层会产生内在的计算开销。相比之下，在LLM之前进行token剪枝的方法(outer-LLM pruning)主要解决单个帧内的空间冗余或有限的时间窗口内的冗余，忽略了跨较长视频序列的关键全局时间动态和相关性。这导致了次优的时空缩减，并且没有充分利用视频的可压缩性。更重要的是，结合这些策略的协同潜力和相互影响仍未被探索。为了进一步减少冗余，我们引入了HoliTom，一种新颖的免训练整体token合并框架。HoliTom采用outer-LLM pruning，通过全局冗余感知的时间分割，然后进行时空合并，将视觉token减少90%以上，显著减轻了LLM的计算负担。作为补充，我们引入了一种鲁棒的inner-LLM token相似性合并方法，该方法专为卓越的性能和与outer-LLM pruning的兼容性而设计。评估表明，我们的方法在LLaVA-OneVision-7B上实现了有希望的效率-性能权衡，将计算成本降低到FLOPs的6.9%，同时保持了原始性能的99.1%。此外，我们在Time-To-First-Token (TTFT)方面实现了2.28倍的降低，在解码吞吐量方面实现了1.32倍的加速，突出了我们集成的剪枝方法对于高效视频LLMs推理的实际好处。

🔬 方法详解

问题定义：视频大语言模型处理长视频时，存在大量的视觉token冗余，导致计算成本高昂。现有的token剪枝方法，如inner-LLM pruning，在LLM内部操作，引入了额外的计算开销；而outer-LLM pruning，主要关注帧内或短时窗口内的冗余，忽略了长时序的全局信息，导致压缩效率不高。因此，如何高效地减少视频token冗余，同时保持视频理解能力，是一个关键问题。

核心思路：HoliTom的核心思路是结合outer-LLM和inner-LLM两种剪枝策略的优势，通过全局时序分析进行粗粒度的token合并，再利用token相似性进行细粒度的token合并，从而实现高效的token压缩。这种方法旨在充分利用视频的时空冗余，并减轻LLM的计算负担。

技术框架：HoliTom框架包含两个主要阶段：1) Outer-LLM Pruning: 首先，对视频进行全局冗余感知的时间分割，将视频分割成多个片段。然后，在每个片段内进行时空合并，减少视觉token的数量。2) Inner-LLM Pruning: 在LLM内部，基于token相似性进行合并，进一步减少token数量。这两个阶段协同工作，实现整体的token压缩。

关键创新：HoliTom的关键创新在于其整体性，即同时考虑了outer-LLM和inner-LLM的剪枝策略，并设计了一种协同工作的方式。与单独使用一种剪枝策略相比，HoliTom能够更有效地利用视频的时空冗余，实现更高的压缩率和更好的性能。此外，HoliTom是免训练的，避免了额外的训练成本。

关键设计：Outer-LLM pruning中，时间分割策略旨在识别视频中内容相似或重复的片段，从而可以合并这些片段的token。时空合并策略则旨在减少每个片段内的token数量，同时保留关键信息。Inner-LLM pruning中，token相似性度量方法用于识别可以合并的相似token。具体的参数设置和网络结构细节在论文中进行了详细描述（未知）。

🖼️ 关键图片

📊 实验亮点

HoliTom在LLaVA-OneVision-7B上实现了显著的性能提升。在保持99.1%原始性能的同时，将计算成本降低到FLOPs的6.9%。此外，HoliTom还实现了2.28倍的TTFT降低和1.32倍的解码吞吐量加速，证明了其在实际应用中的高效性。

🎯 应用场景

HoliTom可应用于各种需要处理长视频的场景，例如视频监控、自动驾驶、视频摘要、视频检索等。通过降低计算成本，HoliTom可以使视频LLM在资源受限的设备上运行，并提高视频处理的效率。未来，HoliTom可以进一步扩展到其他多模态任务，例如视频问答、视频生成等。

📄 摘要（原文）

Video large language models (video LLMs) excel at video comprehension but face significant computational inefficiency due to redundant video tokens. Existing token pruning methods offer solutions. However, approaches operating within the LLM (inner-LLM pruning), such as FastV, incur intrinsic computational overhead in shallow layers. In contrast, methods performing token pruning before the LLM (outer-LLM pruning) primarily address spatial redundancy within individual frames or limited temporal windows, neglecting the crucial global temporal dynamics and correlations across longer video sequences. This leads to sub-optimal spatio-temporal reduction and does not leverage video compressibility fully. Crucially, the synergistic potential and mutual influence of combining these strategies remain unexplored. To further reduce redundancy, we introduce HoliTom, a novel training-free holistic token merging framework. HoliTom employs outer-LLM pruning through global redundancy-aware temporal segmentation, followed by spatial-temporal merging to reduce visual tokens by over 90%, significantly alleviating the LLM's computational burden. Complementing this, we introduce a robust inner-LLM token similarity-based merging approach, designed for superior performance and compatibility with outer-LLM pruning. Evaluations demonstrate our method's promising efficiency-performance trade-off on LLaVA-OneVision-7B, reducing computational costs to 6.9% of FLOPs while maintaining 99.1% of the original performance. Furthermore, we achieve a 2.28x reduction in Time-To-First-Token (TTFT) and a 1.32x acceleration in decoding throughput, highlighting the practical benefits of our integrated pruning approach for efficient video LLMs inference.

HoliTom: Holistic Token Merging for Fast Video Large Language Models

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理