Bottleneck Tokens for Unified Multimodal Retrieval
作者: Siyu Sun, Jing Ren, Zhaohe Liao, Dongxiao Mao, Xiangyuan Ren, Yiyi Zhang, Haohua Zhao, Weixiong Lin, Jiang Shaohua, Liqing Zhang, Yuchao Zheng
分类: cs.LG, cs.AI
发布日期: 2026-04-13
💡 一句话要点
提出Bottleneck Tokens (BToks)用于统一多模态检索,解决信息聚合和token级别指导问题。
🎯 匹配领域: 支柱九:具身大模型 (Embodied Foundation Models)
关键词: 多模态检索 信息瓶颈 Bottleneck Tokens 生成式学习 语义压缩
📋 核心要点
- 现有方法依赖隐式池化进行多模态信息聚合,导致信息瓶颈和缺乏token级别的压缩指导。
- 引入Bottleneck Tokens (BToks)作为显式池化机制,并通过生成式信息压缩提供token级别的监督信号。
- 实验表明,该方法在MMEB-V2上取得了SOTA结果,尤其在语义理解任务上提升显著。
📝 摘要(中文)
为了使decoder-only多模态大语言模型(MLLMs)适应统一多模态检索,本文指出了两个结构性差距。一是现有方法依赖隐式池化,将标准词汇token(如
🔬 方法详解
问题定义:现有基于decoder-only MLLMs的多模态检索方法,主要依赖隐式池化(例如使用
核心思路:本文的核心思路是引入显式的、可学习的Bottleneck Tokens (BToks)作为信息瓶颈,强制模型将所有相关信息压缩到这些token中。同时,通过生成式信息压缩,提供token级别的监督信号,指导模型如何有效地进行信息压缩。
技术框架:整体框架包括一个decoder-only MLLM,并在输入序列中加入一组可学习的BToks。训练阶段,使用next-token预测作为主要目标,并引入Condensation Mask,阻止目标token直接关注查询token,迫使信息必须通过BToks。推理阶段,只需要输入和BToks进行一次前向传播,提取BToks的表示用于检索。
关键创新:最重要的创新点在于引入了BToks作为显式的、可学习的信息瓶颈,并结合生成式信息压缩,实现了token级别的语义压缩监督。这与现有方法依赖隐式池化和缺乏token级别指导形成了鲜明对比。
关键设计:BToks的数量是一个关键参数,需要根据任务复杂度进行调整。Condensation Mask的设计确保了所有信息都必须通过BToks进行传递。损失函数主要由next-token预测损失构成,用于指导BToks学习有效的语义表示。
🖼️ 关键图片
📊 实验亮点
在MMEB-V2基准测试中,该方法在2B规模的模型中取得了state-of-the-art的结果,总体得分达到59.0,相比VLM2Vec-V2提升了3.6。尤其在语义要求较高的任务(如Video-QA)上,取得了显著的提升,达到了12.6。
🎯 应用场景
该研究成果可应用于各种多模态检索场景,例如图像/视频搜索、跨模态问答、多模态内容推荐等。通过更有效地聚合和压缩多模态信息,可以提升检索的准确性和效率,为用户提供更好的搜索体验。未来,该方法可以进一步扩展到更多模态和更复杂的检索任务中。
📄 摘要(原文)
Adapting decoder-only multimodal large language models (MLLMs) for unified multimodal retrieval faces two structural gaps. First, existing methods rely on implicit pooling, which overloads the hidden state of a standard vocabulary token (e.g.,
) as the sequence-level representation, a mechanism never designed for information aggregation. Second, contrastive fine-tuning specifies what the embedding should match but provides no token-level guidance on how information should be compressed into it. We address both gaps with two complementary components. Architecturally, we introduce Bottleneck Tokens (BToks), a small set of learnable tokens that serve as a fixed-capacity explicit pooling mechanism. For training, we propose Generative Information Condensation: a next-token prediction objective coupled with a Condensation Mask that severs the direct attention path from target tokens to query tokens. All predictive signals are thereby forced through the BToks, converting the generative loss into dense, token-level supervision for semantic compression. At inference time, only the input and BToks are processed in a single forward pass with negligible overhead over conventional last-token pooling. On MMEB-V2 (78 datasets, 3 modalities, 9 meta-tasks), our approach achieves state-of-the-art among 2B-scale methods under comparable data conditions, attaining an Overall score of 59.0 (+3.6 over VLM2Vec-V2) with substantial gains on semantically demanding tasks (e.g., +12.6 on Video-QA).