Alloc-MoE: Budget-Aware Expert Activation Allocation for Efficient Mixture-of-Experts Inference

作者: Baihui Liu, Kaiyuan Tian, Wei Wang, Zhaoning Zhang, Linbo Qiao, Dongsheng Li

分类: cs.LG, cs.AI, cs.CL

发布日期: 2026-04-09

备注: ACL 2026 main

💡 一句话要点

提出Alloc-MoE以解决稀疏激活导致的推理延迟问题

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: Mixture-of-Experts 激活预算 推理优化 动态规划 性能提升

📋 核心要点

现有的Mixture-of-Experts架构在推理时由于专家激活数量庞大，导致显著的延迟，尤其在资源受限的环境中表现不佳。
本文提出了Alloc-MoE框架，通过引入激活预算的概念，优化层级和token级别的专家激活分配，以减少性能下降。
实验结果显示，Alloc-MoE在DeepSeek-V2-Lite模型上实现了1.15倍的预填充和1.34倍的解码速度提升，且在预算限制下保持了模型性能。

📝 摘要（中文）

Mixture-of-Experts (MoE) 由于其稀疏激活机制，已成为扩展大型语言模型的主流架构。然而，专家激活的数量庞大在推理过程中造成了显著的延迟瓶颈，尤其是在资源受限的部署场景中。现有减少专家激活的方法可能导致模型性能严重下降。本文引入了激活预算的概念，提出了Alloc-MoE框架，在层级和token级别上协调优化预算分配，以最小化性能下降。在层级上，Alloc-L利用敏感性分析和动态规划确定专家激活的最佳分配。在token级别，Alloc-T根据路由分数动态重新分配激活，优化预算分配而不增加延迟。大量实验表明，Alloc-MoE在受限激活预算下保持了模型性能，尤其在DeepSeek-V2-Lite上实现了1.15倍的预填充和1.34倍的解码速度提升。

🔬 方法详解

问题定义：本文旨在解决Mixture-of-Experts架构在推理过程中由于专家激活数量庞大而导致的延迟瓶颈问题。现有方法在减少激活时可能会显著降低模型性能。

核心思路：论文提出了激活预算的概念，作为对专家激活数量的约束，并通过Alloc-MoE框架在层级和token级别上优化激活分配，以最小化性能损失。

技术框架：Alloc-MoE框架包含两个主要模块：Alloc-L和Alloc-T。Alloc-L在层级上通过敏感性分析和动态规划确定最佳激活分配，而Alloc-T则在token级别根据路由分数动态调整激活分配。

关键创新：Alloc-MoE的核心创新在于引入激活预算的概念，并在层级和token级别上协调优化激活分配，这与现有方法的单一层级或token优化策略有本质区别。

关键设计：在Alloc-L中，使用敏感性分析评估各层对模型性能的影响，并通过动态规划实现激活的最优分配。在Alloc-T中，激活的动态重新分配基于实时计算的路由分数，确保在不增加延迟的情况下优化预算分配。

🖼️ 关键图片

📊 实验亮点

实验结果表明，Alloc-MoE在DeepSeek-V2-Lite模型上实现了1.15倍的预填充速度提升和1.34倍的解码速度提升，同时在激活预算限制下保持了模型的性能，显示出其在实际应用中的有效性。

🎯 应用场景

该研究的潜在应用领域包括大型语言模型的推理优化，尤其是在资源受限的环境中，如移动设备或边缘计算场景。通过有效管理专家激活，Alloc-MoE能够在保持模型性能的同时，显著提高推理速度，具有重要的实际价值和广泛的应用前景。

📄 摘要（原文）

Mixture-of-Experts (MoE) has become a dominant architecture for scaling large language models due to their sparse activation mechanism. However, the substantial number of expert activations creates a critical latency bottleneck during inference, especially in resource-constrained deployment scenarios. Existing approaches that reduce expert activations potentially lead to severe model performance degradation. In this work, we introduce the concept of \emph{activation budget} as a constraint on the number of expert activations and propose Alloc-MoE, a unified framework that optimizes budget allocation coordinately at both the layer and token levels to minimize performance degradation. At the layer level, we introduce Alloc-L, which leverages sensitivity profiling and dynamic programming to determine the optimal allocation of expert activations across layers. At the token level, we propose Alloc-T, which dynamically redistributes activations based on routing scores, optimizing budget allocation without increasing latency. Extensive experiments across multiple MoE models demonstrate that Alloc-MoE maintains model performance under a constrained activation budget. Especially, Alloc-MoE achieves $1.15\times$ prefill and $1.34\times$ decode speedups on DeepSeek-V2-Lite at half of the original budget.

Alloc-MoE: Budget-Aware Expert Activation Allocation for Efficient Mixture-of-Experts Inference

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理