Tackling the Dynamicity in a Production LLM Serving System with SOTA Optimizations via Hybrid Prefill/Decode/Verify Scheduling on Efficient Meta-kernels

作者: Mingcong Song, Xinru Tang, Fengfan Hou, Jing Li, Wei Wei, Yipeng Ma, Runqiu Xiao, Hongjie Si, Dingcheng Jiang, Shouyi Yin, Yang Hu, Guoping Long

分类: cs.AI, cs.DC, cs.LG

发布日期: 2024-12-24

💡 一句话要点

XY-Serve：通过混合调度和高效Meta-kernel优化，解决生产LLM服务系统中的动态性问题。

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: LLM服务 动态性优化 Meta-kernel Ascend NPU 混合调度 GEMM优化 Attention优化

📋 核心要点

生产LLM服务面临动态输入输出长度带来的工作负载可变性挑战，现有优化技术难以在AI加速器上保持高效率。
XY-Serve通过将计算分解为硬件友好的细粒度meta primitives，并针对Attention和GEMM设计了专用优化方案，来平滑工作负载。
实验表明，XY-Serve在Ascend NPU上实现了高达89%的端到端吞吐量提升，并且优于现有的GEMM和Attention内核。

📝 摘要（中文）

为了满足生产级大型语言模型（LLM）服务系统中对低延迟和成本效率日益增长的需求，需要集成先进的优化技术。然而，LLM动态且不可预测的输入输出长度，加上这些优化，加剧了工作负载可变性的问题，使得在AI加速器上，特别是具有基于tile编程模型的DSA上，难以维持高效率。为了解决这个挑战，我们引入了XY-Serve，一个通用的、Ascend原生的、端到端生产LLM服务系统。其核心思想是一种抽象机制，通过将计算分解为统一的、硬件友好的、细粒度的meta primitives来平滑工作负载的可变性。对于attention，我们提出了一种meta-kernel，它计算具有架构感知tile大小的matmul-softmax-matmul基本模式。对于GEMM，我们引入了一种虚拟padding方案，该方案适应动态形状变化，同时使用具有各种固定tile大小的高效GEMM primitives。XY-Serve与vLLM和谐共存。实验结果表明，与当前公开的基线相比，在Ascend NPU上端到端吞吐量提高了高达89%。此外，相对于现有库，我们的方法优于现有的GEMM（平均快14.6%）和attention（平均快21.5%）内核。虽然这项工作是Ascend原生的，但我们相信该方法也可以很容易地应用于SIMT架构。

🔬 方法详解

问题定义：论文旨在解决生产环境中LLM服务系统面临的动态性问题。具体来说，LLM的输入输出长度是动态变化的，这导致了计算负载的不确定性。现有的优化方法难以有效地处理这种动态性，尤其是在基于tile的DSA架构上，容易造成资源利用率低下和性能瓶颈。

核心思路：论文的核心思路是通过抽象机制来平滑工作负载的可变性。具体而言，将复杂的计算分解为统一的、硬件友好的、细粒度的meta primitives。这样，即使输入输出长度发生变化，也可以通过调整meta primitives的组合方式来适应，从而提高硬件利用率。

技术框架：XY-Serve是一个端到端的LLM服务系统，它与vLLM兼容。其主要组成部分包括：1) 一个抽象层，用于将计算分解为meta primitives；2) 针对Attention和GEMM的优化meta-kernel；3) 一个混合调度器，用于协调Prefill、Decode和Verify阶段的计算。整体流程是：接收请求 -> Prefill阶段（处理prompt） -> Decode阶段（生成token） -> Verify阶段（验证结果） -> 返回结果。

关键创新：论文的关键创新在于提出了针对Attention和GEMM的优化meta-kernel。对于Attention，设计了一种架构感知的matmul-softmax-matmul meta-kernel，充分利用了硬件的并行计算能力。对于GEMM，引入了一种虚拟padding方案，可以适应动态形状变化，同时使用高效的固定tile大小的GEMM primitives。

关键设计：Attention meta-kernel的关键设计在于tile大小的选择，需要根据硬件架构进行调整，以最大化并行度和减少访存开销。GEMM虚拟padding方案的关键设计在于padding的大小和位置，需要保证计算的正确性，同时尽量减少padding带来的额外计算量。此外，混合调度器的设计也需要考虑Prefill、Decode和Verify阶段的资源需求，以实现最佳的整体性能。

🖼️ 关键图片

📊 实验亮点

XY-Serve在Ascend NPU上实现了显著的性能提升。与当前公开的基线相比，端到端吞吐量提高了高达89%。此外，XY-Serve的GEMM内核比现有库平均快14.6%，Attention内核比现有库平均快21.5%。这些结果表明，XY-Serve能够有效地解决生产LLM服务系统中的动态性问题。

🎯 应用场景

该研究成果可广泛应用于各种需要低延迟和高吞吐量的LLM服务场景，例如在线问答、文本生成、机器翻译等。通过提高LLM服务的效率，可以降低运营成本，并为用户提供更好的体验。此外，该方法也可以推广到其他类型的深度学习模型和服务系统。

📄 摘要（原文）

Meeting growing demands for low latency and cost efficiency in production-grade large language model (LLM) serving systems requires integrating advanced optimization techniques. However, dynamic and unpredictable input-output lengths of LLM, compounded by these optimizations, exacerbate the issues of workload variability, making it difficult to maintain high efficiency on AI accelerators, especially DSAs with tile-based programming models. To address this challenge, we introduce XY-Serve, a versatile, Ascend native, end-to-end production LLM-serving system. The core idea is an abstraction mechanism that smooths out the workload variability by decomposing computations into unified, hardware-friendly, fine-grained meta primitives. For attention, we propose a meta-kernel that computes the basic pattern of matmul-softmax-matmul with architectural-aware tile sizes. For GEMM, we introduce a virtual padding scheme that adapts to dynamic shape changes while using highly efficient GEMM primitives with assorted fixed tile sizes. XY-Serve sits harmoniously with vLLM. Experimental results show up to 89% end-to-end throughput improvement compared with current publicly available baselines on Ascend NPUs. Additionally, our approach outperforms existing GEMM (average 14.6% faster) and attention (average 21.5% faster) kernels relative to existing libraries. While the work is Ascend native, we believe the approach can be readily applicable to SIMT architectures as well.

Tackling the Dynamicity in a Production LLM Serving System with SOTA Optimizations via Hybrid Prefill/Decode/Verify Scheduling on Efficient Meta-kernels

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理