Flexible Operator Fusion for Fast Sparse Transformer with Diverse Masking on GPU

作者: Wenhao Dai, Haodong Deng, Mengfei Rong, Xinyu Yang, Hongyu Liu, Fangxin Liu, Hailong Yang, Qianwen Cao, Qingxiao Sun

分类: cs.LG

发布日期: 2025-06-06 (更新: 2025-08-19)

💡 一句话要点

提出STOF框架以优化稀疏Transformer的性能

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 稀疏Transformer 操作符融合 GPU加速 多头注意力 性能优化 灵活掩码 大型语言模型

📋 核心要点

现有稀疏Transformer的性能优化研究较少，且基于规则的机制未能有效利用混合类型操作符的融合机会。
本文提出STOF框架，通过统一存储格式和内核实现，灵活掩码和操作符融合来优化稀疏Transformer。
实验结果显示，STOF在多头注意力计算中实现了1.7倍的加速，在端到端推理中实现了1.5倍的加速，显著提升了性能。

📝 摘要（中文）

大型语言模型因其强大的理解能力而受到广泛关注。作为LLM的核心组件，通过并行化加速Transformer逐渐成为热门研究课题。掩码层引入稀疏性以减少计算量，但现有研究很少关注稀疏Transformer的性能优化。此外，基于规则的机制忽视了混合类型操作符的融合机会，且未能适应不同的序列长度。为了解决这些问题，本文提出了STOF框架，通过灵活的掩码和操作符融合在GPU上优化稀疏Transformer。实验结果表明，与现有最先进的工作相比，STOF在多头注意力计算中实现了最高1.7倍的加速，在端到端推理中实现了1.5倍的加速。

🔬 方法详解

问题定义：本文旨在解决稀疏Transformer在性能优化方面的不足，尤其是现有方法未能充分利用混合类型操作符的融合机会，且对不同序列长度的适应性较差。

核心思路：STOF框架通过灵活的掩码机制和操作符融合技术，优化稀疏Transformer的计算效率。通过统一存储格式和内核实现，提升了多头注意力的计算性能。

技术框架：STOF的整体架构包括多个模块：首先是多头注意力的存储格式和内核实现的统一；其次是将融合方案映射到编译模板；最后，通过双阶段搜索引擎确定最佳参数设置。

关键创新：STOF的主要创新在于灵活的掩码和操作符融合策略，这与传统的规则基础方法有本质区别，能够更好地适应不同的序列长度和计算需求。

关键设计：在参数设置上，STOF采用了双阶段搜索引擎来优化参数配置，确保在不同场景下都能达到最佳性能。

📊 实验亮点

STOF框架在实验中表现出色，相较于最先进的技术，最大实现了1.7倍的多头注意力计算加速和1.5倍的端到端推理加速，展示了其在性能优化方面的显著优势。

🎯 应用场景

该研究的潜在应用领域包括自然语言处理、机器翻译和对话系统等，能够显著提升大型语言模型的推理速度和计算效率，具有重要的实际价值和广泛的应用前景。

📄 摘要（原文）

Large language models are popular around the world due to their powerful understanding capabilities. As the core component of LLMs, accelerating Transformer through parallelization has gradually become a hot research topic. Mask layers introduce sparsity into Transformer to reduce calculations. However, previous works rarely focus on the performance optimization of sparse Transformer. Moreover, rule-based mechanisms ignore the fusion opportunities of mixed-type operators and fail to adapt to various sequence lengths. To address the above problems, we propose STOF, a framework that incorporates optimizations for Sparse Transformer via flexible masking and operator fusion on GPU. We firstly unify the storage format and kernel implementation for the multi-head attention. Then, we map fusion schemes to compilation templates and determine the optimal parameter setting through a two-stage search engine. The experimental results show that compared to the state-of-the-art work, STOF achieves maximum speedups of 1.7x in MHA computation and 1.5x in end-to-end inference.

Flexible Operator Fusion for Fast Sparse Transformer with Diverse Masking on GPU

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册