Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models

作者: Weixin Liang, Lili Yu, Liang Luo, Srinivasan Iyer, Ning Dong, Chunting Zhou, Gargi Ghosh, Mike Lewis, Wen-tau Yih, Luke Zettlemoyer, Xi Victoria Lin

分类: cs.CL

发布日期: 2024-11-07 (更新: 2025-05-08)

备注: Accepted to TMLR 2025; 48 pages

期刊: Transactions on Machine Learning Research (2025), ISSN: 2835-8856

💡 一句话要点

提出混合Transformer（MoT）架构，用于高效可扩展的多模态基础模型训练。

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 多模态学习 Transformer 稀疏模型 模型压缩 图像生成 语音识别 计算效率

📋 核心要点

多模态LLM训练面临计算资源和数据量巨大挑战，现有密集模型扩展性受限。
MoT通过模态解耦非嵌入参数，实现模态特定处理和全局自注意力，降低计算成本。
实验表明，MoT在文本、图像和语音等多模态任务上，显著降低FLOPs，性能媲美甚至超越密集模型。

📝 摘要（中文）

大型语言模型（LLM）的发展已经扩展到多模态系统，能够在统一框架内处理文本、图像和语音。训练这些模型需要比仅文本LLM更大的数据集和计算资源。为了解决扩展性挑战，我们引入了混合Transformer（MoT），这是一种稀疏多模态Transformer架构，可显著降低预训练计算成本。MoT通过模态解耦模型的非嵌入参数（包括前馈网络、注意力矩阵和层归一化），从而实现模态特定的处理，并对完整输入序列进行全局自注意力。我们在多个设置和模型规模上评估了MoT。在Chameleon 7B设置（自回归文本和图像生成）中，MoT仅使用55.8%的FLOPs即可匹配密集基线的性能。当扩展到包括语音时，MoT仅使用37.2%的FLOPs即可达到与密集基线相当的语音性能。在Transfusion设置中，文本和图像使用不同的目标进行训练，7B MoT模型以三分之一的FLOPs匹配了密集基线的图像模态性能，而760M MoT模型在关键图像生成指标上优于1.4B密集基线。系统分析进一步突出了MoT的实际优势，在47.2%的挂钟时间内实现了密集基线的图像质量，并在75.6%的挂钟时间内实现了文本质量（在配备NVIDIA A100 GPU的AWS p4de.24xlarge实例上测量）。

🔬 方法详解

问题定义：论文旨在解决多模态基础模型训练过程中计算资源需求过高的问题。现有的密集Transformer模型在处理多模态数据时，需要大量的计算资源，限制了模型规模的扩展和训练效率的提升。

核心思路：论文的核心思路是引入稀疏性，通过混合Transformer（MoT）架构，对不同模态的数据进行解耦处理，从而降低计算复杂度。MoT将模型的非嵌入参数（如前馈网络、注意力矩阵和层归一化）按模态进行分离，使得每个模态可以独立进行处理，从而减少了参数量和计算量。

技术框架：MoT架构的核心在于将Transformer的非嵌入参数按模态进行划分。整体架构仍然基于Transformer，但每个Transformer层包含多个模态特定的子模块。输入数据首先经过嵌入层，然后进入MoT层进行处理。在MoT层中，不同模态的数据分别通过各自的子模块进行处理，然后通过全局自注意力机制进行交互。最后，输出数据经过线性层和softmax层，得到最终的预测结果。

关键创新：MoT的关键创新在于模态解耦和稀疏性。通过将模型的非嵌入参数按模态进行分离，MoT实现了模态特定的处理，从而降低了计算复杂度。与传统的密集Transformer模型相比，MoT具有更高的计算效率和更好的扩展性。

关键设计：MoT的关键设计包括：1) 模态特定的前馈网络：每个模态都有自己的前馈网络，用于处理该模态的特征。2) 模态特定的注意力矩阵：每个模态都有自己的注意力矩阵，用于计算该模态内部的注意力权重。3) 全局自注意力机制：用于在不同模态之间进行信息交互。4) 稀疏路由机制（在某些变体中）：用于动态地选择哪些模态参与计算。

🖼️ 关键图片

📊 实验亮点

实验结果表明，MoT在Chameleon 7B设置中，仅使用55.8%的FLOPs即可匹配密集基线的性能。在包含语音的设置中，MoT仅使用37.2%的FLOPs即可达到与密集基线相当的语音性能。在Transfusion设置中，7B MoT模型以三分之一的FLOPs匹配了密集基线的图像模态性能，而760M MoT模型在关键图像生成指标上优于1.4B密集基线。此外，MoT在实际运行中也表现出显著的加速效果。

🎯 应用场景

MoT架构具有广泛的应用前景，可用于构建高效的多模态基础模型，应用于图像生成、视频理解、语音识别、多模态对话等领域。该架构降低了多模态模型训练的计算成本，使得更大规模、更复杂的多模态模型的训练成为可能，从而推动人工智能技术的发展。

📄 摘要（原文）

The development of large language models (LLMs) has expanded to multi-modal systems capable of processing text, images, and speech within a unified framework. Training these models demands significantly larger datasets and computational resources compared to text-only LLMs. To address the scaling challenges, we introduce Mixture-of-Transformers (MoT), a sparse multi-modal transformer architecture that significantly reduces pretraining computational costs. MoT decouples non-embedding parameters of the model by modality -- including feed-forward networks, attention matrices, and layer normalization -- enabling modality-specific processing with global self-attention over the full input sequence. We evaluate MoT across multiple settings and model scales. In the Chameleon 7B setting (autoregressive text-and-image generation), MoT matches the dense baseline's performance using only 55.8\% of the FLOPs. When extended to include speech, MoT reaches speech performance comparable to the dense baseline with only 37.2\% of the FLOPs. In the Transfusion setting, where text and image are trained with different objectives, a 7B MoT model matches the image modality performance of the dense baseline with one third of the FLOPs, and a 760M MoT model outperforms a 1.4B dense baseline across key image generation metrics. System profiling further highlights MoT's practical benefits, achieving dense baseline image quality in 47.2\% of the wall-clock time and text quality in 75.6\% of the wall-clock time (measured on AWS p4de.24xlarge instances with NVIDIA A100 GPUs).

Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理