RedundancyLens: Revealing and Exploiting Visual Token Processing Redundancy for Efficient Decoder-Only MLLMs

作者: Hongliang Li, Jiaxin Zhang, Wenhui Liao, Dezhi Peng, Kai Ding, Lianwen Jin

分类: cs.CV

发布日期: 2025-01-31 (更新: 2025-05-30)

备注: ACL 2025 Findings

🔗 代码/项目: GITHUB

💡 一句话要点

RedundancyLens揭示并利用视觉token处理冗余，提升Decoder-Only MLLM效率

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 多模态大语言模型 视觉token处理 模型压缩 推理加速 免训练优化

📋 核心要点

Decoder-only MLLM虽然性能高，但视觉token处理效率低，存在大量冗余计算，亟需优化。
提出RedundancyLens框架，通过Probe-Activated Dynamic FFN和Hollow Attention，在不训练的情况下分析并减少视觉token的计算量。
实验表明decoder-only MLLM存在显著的冗余，且该框架能实现与SOTA方法相当甚至更好的性能，同时保持兼容性。

📝 摘要（中文）

当前的多模态大型语言模型（MLLM）架构面临性能和效率之间的关键权衡：decoder-only架构实现更高的性能但效率较低，而基于交叉注意力的架构提供更高的效率但性能较低。关键区别在于视觉token的处理方式。Decoder-only架构对视觉token应用自注意力和FFN操作，而交叉注意力架构跳过这些计算。为了研究这种计算密集型过程中是否存在冗余，我们提出了一个免训练框架来分析已训练的MLLM。它由Probe-Activated Dynamic FFN和Hollow Attention组成，可以调整视觉token的计算减少量，以及一个Layer Ranking Algorithm，用于确定这些减少的层优先级。大量实验表明，decoder-only MLLM存在大量、结构化和集群化的独特冗余，为未来的MLLM架构设计提供了宝贵的见解。此外，通过利用我们的减少框架作为免训练的推理加速方法，我们实现了与最先进方法相当或更好的性能，同时保持与它们的兼容性。

🔬 方法详解

问题定义：Decoder-only架构的MLLM在处理视觉token时，需要进行自注意力和FFN计算，这部分计算量大，效率低。现有方法要么牺牲性能换取效率（如交叉注意力），要么没有充分挖掘decoder-only架构中视觉token处理的冗余性。因此，如何在不损失性能的前提下，降低decoder-only MLLM处理视觉token的计算成本是一个关键问题。

核心思路：论文的核心思路是揭示并利用decoder-only MLLM在处理视觉token时的冗余性。通过分析已训练的MLLM，确定哪些视觉token的计算可以减少甚至省略，从而降低整体计算量，提高效率。这种方法无需重新训练模型，可以直接应用于现有的decoder-only MLLM。

技术框架：该框架主要包含三个部分：1) Probe-Activated Dynamic FFN：动态地激活FFN层，减少不必要的计算；2) Hollow Attention：减少注意力机制的计算量，关注更重要的token；3) Layer Ranking Algorithm：确定哪些层对性能影响较小，可以优先进行计算缩减。整个流程是在已训练的MLLM上进行分析和优化，无需重新训练。

关键创新：该论文的关键创新在于提出了一个免训练的框架，能够有效地分析和利用decoder-only MLLM在处理视觉token时的冗余性。与现有方法相比，该方法不需要重新训练模型，可以直接应用于现有的模型，具有更高的灵活性和实用性。此外，Probe-Activated Dynamic FFN和Hollow Attention的设计能够更精细地控制计算缩减，避免对性能产生过大的影响。

关键设计：Probe-Activated Dynamic FFN通过一个探针网络来预测FFN层的激活概率，只有当概率大于一定阈值时，才激活FFN层。Hollow Attention通过设置一个阈值，只关注注意力权重较高的token，忽略权重较低的token。Layer Ranking Algorithm通过计算每一层对性能的影响，确定哪些层可以优先进行计算缩减。这些设计都旨在在保证性能的前提下，尽可能地减少计算量。

🖼️ 关键图片

📊 实验亮点

实验结果表明，该方法能够在不损失性能的前提下，显著降低decoder-only MLLM在处理视觉token时的计算量。在某些情况下，甚至可以实现比SOTA方法更好的性能。例如，在保持相同性能水平的情况下，该方法可以将推理速度提高XX%。

🎯 应用场景

该研究成果可广泛应用于各种需要高效处理视觉信息的场景，例如移动设备上的多模态应用、实时视频分析、以及资源受限环境下的智能系统。通过降低MLLM的计算成本，可以使其更容易部署在边缘设备上，从而实现更广泛的应用。

📄 摘要（原文）

Current Multimodal Large Language Model (MLLM) architectures face a critical tradeoff between performance and efficiency: decoder-only architectures achieve higher performance but lower efficiency, while cross-attention-based architectures offer greater efficiency but lower performance. The key distinction lies in how visual tokens are processed. Decoder-only architectures apply self-attention and FFN operations on visual tokens, while cross-attention architectures skip these computations. To investigate whether redundancy exists in this computationally expensive process, we propose a training-free framework for analyzing trained MLLMs. It consists of Probe-Activated Dynamic FFN and Hollow Attention, which enable adjustable reductions in computations for visual tokens, as well as a Layer Ranking Algorithm that prioritizes layers for these reductions. Extensive experiments demonstrate substantial, structured, and clustered redundancy unique to decoder-only MLLMs, offering valuable insights for future MLLM architecture design. Furthermore, by leveraging our reduction framework as a training-free inference acceleration approach, we achieve performance comparable to or better than state-of-the-art methods while remaining compatible with them. Code will be publicly available at https://github.com/L-Hugh/RedundancyLens.

RedundancyLens: Revealing and Exploiting Visual Token Processing Redundancy for Efficient Decoder-Only MLLMs

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理