Mixture-of-Mamba: Enhancing Multi-Modal State-Space Models with Modality-Aware Sparsity
作者: Weixin Liang, Junhong Shen, Genghan Zhang, Ning Dong, Luke Zettlemoyer, Lili Yu
分类: cs.LG, cs.AI, cs.CL, cs.CV
发布日期: 2025-01-27
🔗 代码/项目: GITHUB
💡 一句话要点
提出Mixture-of-Mamba以解决多模态状态空间模型的稀疏性问题
🎯 匹配领域: 支柱二:RL算法与架构 (RL & Architecture)
关键词: 状态空间模型 多模态预训练 模态感知稀疏性 计算效率 深度学习
📋 核心要点
- 现有的状态空间模型在多模态预训练中无法有效利用模态特定特征,导致性能受限。
- 本文提出Mixture-of-Mamba,通过模态特定参数化引入模态感知稀疏性,提升SSMs的表现。
- 实验表明,Mixture-of-Mamba在多个设置中以更少的计算成本达到了相似的损失值,展示了其高效性。
📝 摘要(中文)
状态空间模型(SSMs)作为序列建模的高效替代方案,因无法利用特定模态特征而限制了其在多模态预训练中的表现。本文提出了一种新颖的SSM架构Mixture-of-Mamba,通过对Mamba模块进行模态特定参数化,引入模态感知稀疏性。基于Mixture-of-Transformers的研究,我们将模态感知稀疏性的优势扩展至SSMs,同时保持计算效率。实验结果显示,在多个多模态预训练设置中,Mixture-of-Mamba在较早的训练步骤中达到了相同的损失值,并显著降低了计算成本,展示了模态感知稀疏性作为一种有效设计原则的潜力。
🔬 方法详解
问题定义:本文旨在解决现有状态空间模型在多模态预训练中无法有效利用模态特定特征的问题,导致其性能受限。现有方法未能充分利用不同模态之间的特性,影响了模型的表现。
核心思路:提出Mixture-of-Mamba架构,通过模态特定参数化的方式引入模态感知稀疏性,从而提升模型在多模态任务中的表现。此设计旨在保留SSMs的计算效率,同时增强其对模态特征的适应能力。
技术框架:Mixture-of-Mamba的整体架构包括多个模块,主要由模态特定的Mamba块组成,结合不同模态的输入进行处理。模型在训练过程中通过不同的损失函数优化每个模态的表现,确保各模态之间的协同作用。
关键创新:最重要的技术创新在于模态感知稀疏性的引入,使得SSMs能够更有效地利用模态特定特征。这一设计与传统的Transformer模型相比,显著提升了多模态任务的性能和计算效率。
关键设计:在模型设计中,采用了模态特定的参数设置,损失函数针对不同模态进行了优化。此外,实验中还探讨了投影组件的解耦对性能的影响,发现联合解耦带来的增益大于单独修改的效果。
🖼️ 关键图片
📊 实验亮点
在Transfusion设置中,Mixture-of-Mamba以仅34.76%的训练FLOPs达到了相同的图像损失;在Chameleon设置中,图像损失以42.50%的FLOPs达成,文本损失则为65.40%。在三模态设置中,模型以24.80%的FLOPs匹配了语音损失,展示了显著的计算效率提升。
🎯 应用场景
该研究的潜在应用领域包括多模态理解、自然语言处理与计算机视觉的结合等。通过提升多模态模型的效率和性能,Mixture-of-Mamba可在智能助手、自动驾驶、医疗影像分析等实际场景中发挥重要作用,推动相关技术的发展与应用。
📄 摘要(原文)
State Space Models (SSMs) have emerged as efficient alternatives to Transformers for sequential modeling, but their inability to leverage modality-specific features limits their performance in multi-modal pretraining. Here, we propose Mixture-of-Mamba, a novel SSM architecture that introduces modality-aware sparsity through modality-specific parameterization of the Mamba block. Building on Mixture-of-Transformers (W. Liang et al. arXiv:2411.04996; 2024), we extend the benefits of modality-aware sparsity to SSMs while preserving their computational efficiency. We evaluate Mixture-of-Mamba across three multi-modal pretraining settings: Transfusion (interleaved text and continuous image tokens with diffusion loss), Chameleon (interleaved text and discrete image tokens), and an extended three-modality framework incorporating speech. Mixture-of-Mamba consistently reaches the same loss values at earlier training steps with significantly reduced computational costs. In the Transfusion setting, Mixture-of-Mamba achieves equivalent image loss using only 34.76% of the training FLOPs at the 1.4B scale. In the Chameleon setting, Mixture-of-Mamba reaches similar image loss with just 42.50% of the FLOPs at the 1.4B scale, and similar text loss with just 65.40% of the FLOPs. In the three-modality setting, MoM matches speech loss at 24.80% of the FLOPs at the 1.4B scale. Our ablation study highlights the synergistic effects of decoupling projection components, where joint decoupling yields greater gains than individual modifications. These results establish modality-aware sparsity as a versatile and effective design principle, extending its impact from Transformers to SSMs and setting new benchmarks in multi-modal pretraining. Our code can be accessed at https://github.com/Weixin-Liang/Mixture-of-Mamba