Large Language Model Partitioning for Low-Latency Inference at the Edge

作者: Dimitrios Kafetzis, Ramin Khalili, Iordanis Koutsopoulos

分类: cs.DC, cs.AI

发布日期: 2025-05-05

💡 一句话要点

提出资源感知的LLM Transformer头划分算法，降低边缘设备推理延迟。

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 大型语言模型 边缘计算 模型划分 低延迟推理 Transformer 注意力机制 资源感知

📋 核心要点

现有基于层的LLM划分方法在资源受限的边缘环境中，容易导致内存溢出或推理延迟过高。
提出一种资源感知的Transformer架构划分算法，在注意力头级别进行划分和动态迁移，以降低推理延迟。
实验表明，该方法在推理速度和内存使用方面优于现有基于层的划分方法，且延迟接近最优解。

📝 摘要（中文）

本文提出了一种资源感知的Transformer架构划分算法，用于降低边缘环境中大型语言模型（LLM）的推理延迟。该算法在token生成过程中定期更新划分决策，基于设备资源可用性和网络带宽的瞬时信息进行决策。该方法在初始阶段将模型块放置在设备上，并在后续执行中迁移这些块，以最小化迁移延迟和推理延迟之和。该方法在注意力头级别划分解码器，将每个注意力头与其键值缓存共同定位，并允许在资源紧张时进行动态迁移。通过将不同的注意力头分配给不同的设备，利用注意力头的并行执行，从而显著降低推理延迟。实验表明，在小型设置（3-5个设备）中，该方法实现的延迟在精确最优解算器的15%到20%以内，而在大型测试中，与最先进的基于层的划分方法相比，在推理速度和内存使用方面取得了显著的改进。

🔬 方法详解

问题定义：大型语言模型（LLM）在边缘设备上的低延迟推理面临挑战。自回归解码器Transformer的token生成过程会不断增长键值缓存，导致内存和计算负载增加。传统的基于层的模型划分方法在资源受限的边缘环境中，容易导致内存溢出或推理延迟过高。

核心思路：论文的核心思路是在注意力头级别对Transformer解码器进行划分，并将每个注意力头与其键值缓存共同定位。通过动态地将注意力头迁移到不同的设备上，利用注意力头的并行执行来降低推理延迟。资源分配和迁移决策基于设备资源可用性和网络带宽的瞬时信息，以最小化迁移延迟和推理延迟之和。

技术框架：该方法包含以下主要步骤：1) 初始阶段，根据设备资源情况将注意力头块放置在不同的设备上。2) 在token生成过程中，定期评估设备资源利用率和网络带宽。3) 基于评估结果，动态地将注意力头块迁移到其他设备，以优化推理延迟。整个过程是一个迭代的资源分配和迁移过程。

关键创新：该方法最重要的创新点在于注意力头级别的划分和动态迁移策略。与传统的基于层的划分方法相比，该方法能够更细粒度地利用设备资源，并实现注意力头的并行执行，从而显著降低推理延迟。动态迁移策略能够根据资源变化自适应地调整模型划分，进一步提高资源利用率。

关键设计：该算法是一个myopic算法，即基于瞬时信息进行决策。迁移决策的目标是最小化迁移延迟和推理延迟之和。具体的迁移策略和资源评估方法（例如，CPU利用率、内存占用、网络带宽）未在摘要中详细说明，属于未知信息。

🖼️ 关键图片

📊 实验亮点

实验结果表明，在小型设置（3-5个设备）中，该方法实现的延迟在精确最优解算器的15%到20%以内。在大型测试中，与最先进的基于层的划分方法相比，该方法在推理速度和内存使用方面取得了显著的改进。具体的性能提升数据未在摘要中详细说明，属于未知信息。

🎯 应用场景

该研究成果可应用于各种需要在边缘设备上进行低延迟LLM推理的场景，例如智能助手、实时翻译、对话机器人等。通过优化模型划分和资源分配，可以提高边缘设备的推理速度和效率，从而改善用户体验，并降低部署成本。未来，该方法可以进一步扩展到更复杂的模型架构和更广泛的边缘计算环境。

📄 摘要（原文）

Large Language Models (LLMs) based on autoregressive, decoder-only Transformers generate text one token at a time, where a token represents a discrete unit of text. As each newly produced token is appended to the partial output sequence, the length grows and so does the memory and compute load, due to the expanding key-value caches, which store intermediate representations of all previously generated tokens in the multi-head attention (MHA) layer. As this iterative process steadily increases memory and compute demands, layer-based partitioning in resource-constrained edge environments often results in memory overload or high inference latency. To address this and reduce inference latency, we propose a resource-aware Transformer architecture partitioning algorithm, where the partitioning decision is updated at regular intervals during token generation. The approach is myopic in that it is based on instantaneous information about device resource availability and network link bandwidths. When first executed, the algorithm places blocks on devices, and in later executions, it migrates these blocks among devices so that the sum of migration delay and inference delay remains low. Our approach partitions the decoder at the attention head level, co-locating each attention head with its key-value cache and allowing dynamic migrations whenever resources become tight. By allocating different attention heads to different devices, we exploit parallel execution of attention heads and thus achieve substantial reductions in inference delays. Our experiments show that in small-scale settings (3-5 devices), the proposed method achieves within 15 to 20 percent of an exact optimal solver's latency, while in larger-scale tests it achieves notable improvements in inference speed and memory usage compared to state-of-the-art layer-based partitioning approaches.

Large Language Model Partitioning for Low-Latency Inference at the Edge

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理