What Kind of Visual Tokens Do We Need? Training-free Visual Token Pruning for Multi-modal Large Language Models from the Perspective of Graph

作者: Yutao Jiang, Qiong Wu, Wenhao Lin, Wei Yu, Yiyi Zhou

分类: cs.CV, cs.AI

发布日期: 2025-01-04

备注: 9 pages, 6 figures

💡 一句话要点

提出G-Prune以解决多模态大语言模型的视觉token冗余问题

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 多模态大语言模型 视觉token剪枝 图结构 计算效率 语义相似性

📋 核心要点

现有的多模态大语言模型在视觉token使用上存在冗余，导致计算资源浪费和性能瓶颈。
本文提出的G-Prune方法通过图结构对视觉token进行剪枝，优化了token的选择过程，减少了冗余。
实验表明，G-Prune在VQA2.0和TextVQA上分别减少63.57%的FLOPs，仅有0.95%和2.34%的准确率下降。

📝 摘要（中文）

近年来，多模态大语言模型（MLLMs）通常使用大量视觉token来弥补其视觉短板，导致计算过度和明显的视觉冗余。本文研究了MLLMs所需的视觉token类型，发现前景和背景token在不同难度的示例中都至关重要。基于此观察，提出了一种基于图的训练无关视觉token剪枝方法G-Prune。G-Prune将视觉token视为节点，并基于语义相似性构建连接，随后通过加权链接传播信息流，保留经过迭代后最重要的token。实验结果表明，G-Prune能够显著减少计算开销，同时在粗细粒度任务上保持高性能。

🔬 方法详解

问题定义：本文旨在解决多模态大语言模型中视觉token的冗余问题，现有方法往往使用过多的token，导致计算效率低下和性能下降。

核心思路：G-Prune通过图结构将视觉token视为节点，基于语义相似性构建连接，从而实现训练无关的token剪枝，优化token的选择。

技术框架：G-Prune的整体流程包括三个主要阶段：首先，将视觉token作为图的节点；其次，基于语义相似性构建节点之间的加权连接；最后，通过信息流传播，保留重要的token。

关键创新：G-Prune的创新在于其训练无关的剪枝方法，通过图结构优化token选择，与传统方法相比，显著提高了计算效率。

关键设计：在G-Prune中，关键的参数设置包括节点的连接权重和信息传播的迭代次数，确保在保留重要token的同时，减少冗余token的数量。具体的损失函数和网络结构设计尚未详细说明。

🖼️ 关键图片

📊 实验亮点

实验结果显示，G-Prune在VQA2.0和TextVQA任务中分别减少了63.57%的FLOPs，同时仅有0.95%和2.34%的准确率下降，证明了其在保持性能的同时显著降低计算开销的能力。

🎯 应用场景

该研究的潜在应用领域包括计算机视觉、自然语言处理和人机交互等。通过优化视觉token的使用，G-Prune可以在多模态任务中提高模型的效率和性能，具有广泛的实际价值和未来影响。

📄 摘要（原文）

Recent Multimodal Large Language Models(MLLMs) often use a large number of visual tokens to compensate their visual shortcoming, leading to excessive computation and obvious visual redundancy. In this paper, we investigate what kind of visual tokens are needed for MLLMs, and reveal that both foreground and background tokens are critical for MLLMs given the varying difficulties of examples. Based on this observation, we propose a graph-based method towards training-free visual token pruning, termed G-Prune.In particular, G-Prune regards visual tokens as nodes, and construct their connections based on their semantic similarities. Afterwards, the information flow is propagated via weighted links, and the most important tokens after iterations are kept for MLLMs, which can be front or background.To validate G-Prune, we apply it to a recent MLLM called LLaVA-NeXT, and conduct extensive experiments on a set of benchmarks.The experiment results show that G-Prune can greatly reduce computation overhead while retaining high performance on both coarse- and fine-grained tasks. For instance, G-Prune can reduce 63.57\% FLOPs of LLaVA-NeXT on VQA2.0 and TextVQA with only 0.95\% and 2.34\% accuracy drops, respectively.

What Kind of Visual Tokens Do We Need? Training-free Visual Token Pruning for Multi-modal Large Language Models from the Perspective of Graph

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理