L3: DIMM-PIM Integrated Architecture and Coordination for Scalable Long-Context LLM Inference

作者: Qingyuan Liu, Liyan Chen, Yanning Yang, Haocheng Wang, Dong Du, Zhigang Mao, Naifeng Jing, Yubin Xia, Haibo Chen

分类: cs.AR, cs.LG

发布日期: 2025-04-24

备注: 16 pages, 11 figures

💡 一句话要点

提出L3架构以解决长文本序列推理中的内存瓶颈问题

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 大型语言模型 长文本推理 DIMM-PIM GPU加速 内存优化 硬件协同设计 多头注意力

📋 核心要点

现有方法在处理长文本序列时面临GPU内存容量与带宽的权衡，限制了大型语言模型的推理能力。
L3系统通过硬件重设计、通信优化和自适应调度器，解决了DIMM-PIM与GPU之间的协调问题，提升了推理效率。
实验结果表明，L3在真实场景下相较于最先进的HBM-PIM解决方案实现了最高6.1倍的加速，并显著提高了批处理大小。

📝 摘要（中文）

大型语言模型（LLMs）在处理长文本序列时面临GPU内存限制，导致内存容量与带宽之间的艰难权衡。虽然基于高带宽内存（HBM）的加速提供了高带宽，但其容量仍然有限。将数据卸载到主机侧DIMM可以提高容量，但会引入昂贵的数据交换开销。我们发现多头注意力（MHA）解码阶段是关键的内存瓶颈，需大量容量存储KV缓存并高带宽进行注意力计算。我们的关键见解是，该操作与现代DIMM-PIM架构高度契合，提供了容量和带宽的可扩展性。基于此，我们提出了L3，一个硬件-软件协同设计的系统，集成了DIMM-PIM和GPU设备。

🔬 方法详解

问题定义：本论文旨在解决大型语言模型在长文本序列推理中的内存瓶颈问题，现有方法在GPU内存容量和带宽之间存在权衡，限制了性能。

核心思路：我们提出L3架构，利用DIMM-PIM架构的可扩展性，专注于多头注意力解码阶段的内存和带宽需求，以提高推理效率。

技术框架：L3系统由三个主要模块组成：硬件重设计模块、通信优化模块和自适应调度器。硬件重设计解决了DIMM-PIM中的数据布局和计算元素不匹配问题，通信优化则通过隐藏数据传输开销来提高效率。

关键创新：L3的创新在于其硬件与软件的协同设计，特别是针对DIMM-PIM架构的优化，使得GPU与DIMM-PIM之间的操作协调更加高效，最大化了设备间的并行性。

关键设计：在设计中，我们关注了数据布局的优化、计算元素的匹配以及调度算法的自适应性，以确保在不同负载下的最佳性能。

🖼️ 关键图片

📊 实验亮点

实验结果显示，L3在真实场景下相比于最先进的HBM-PIM解决方案实现了最高6.1倍的加速，同时显著提高了批处理大小，展示了其在长文本推理中的优越性能。

🎯 应用场景

L3架构在处理长文本序列的推理任务中具有广泛的应用潜力，尤其是在自然语言处理、机器翻译和文本生成等领域。其高效的内存管理和计算能力将推动大型语言模型的实际应用，提升用户体验和系统性能。

📄 摘要（原文）

Large Language Models (LLMs) increasingly require processing long text sequences, but GPU memory limitations force difficult trade-offs between memory capacity and bandwidth. While HBM-based acceleration offers high bandwidth, its capacity remains constrained. Offloading data to host-side DIMMs improves capacity but introduces costly data swapping overhead. We identify that the critical memory bottleneck lies in the decoding phase of multi-head attention (MHA) exclusively, which demands substantial capacity for storing KV caches and high bandwidth for attention computation. Our key insight reveals this operation uniquely aligns with modern DIMM-based processing-in-memory (PIM) architectures, which offers scalability of both capacity and bandwidth. Based on this observation and insight, we propose L3, a hardware-software co-designed system integrating DIMM-PIM and GPU devices. L3 introduces three innovations: First, hardware redesigns resolve data layout mismatches and computational element mismatches in DIMM-PIM, enhancing LLM inference utilization. Second, communication optimization enables hiding the data transfer overhead with the computation. Third, an adaptive scheduler coordinates GPU-DIMM-PIM operations to maximize parallelism between devices. Evaluations using real-world traces show L3 achieves up to 6.1$\times$ speedup over state-of-the-art HBM-PIM solutions while significantly improving batch sizes.

L3: DIMM-PIM Integrated Architecture and Coordination for Scalable Long-Context LLM Inference

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理