SqueezeAttention: 2D Management of KV-Cache in LLM Inference via Layer-wise Optimal Budget
作者: Zihao Wang, Bin Cui, Shaoduo Gan
分类: cs.LG, cs.CL
发布日期: 2024-04-07 (更新: 2024-10-10)
🔗 代码/项目: GITHUB
💡 一句话要点
提出SqueezeAttention以优化LLM推理中的KV缓存管理
🎯 匹配领域: 支柱九:具身大模型 (Embodied Foundation Models)
关键词: 大型语言模型 KV缓存优化 推理效率 动态预算分配 自注意力机制
📋 核心要点
- 现有的KV缓存压缩算法对所有层分配相同的预算,未能充分利用层间的重要性差异,导致资源浪费。
- 本文提出SqueezeAttention,通过评估每层的重要性,动态调整KV缓存预算,从而实现更高效的推理。
- 实验结果表明,SqueezeAttention在多个LLM和基准测试中实现了30%到70%的内存减少和2.2倍的吞吐量提升。
📝 摘要(中文)
优化大型语言模型(LLM)的键值(KV)缓存被认为对节省推理成本至关重要。现有的KV缓存压缩算法大多通过利用不同重要性的标记来稀疏化标记序列,但通常对所有层分配相同的KV预算,这种方法并不理想。本文提出SqueezeAttention,通过识别注意力层的重要性,从序列和层两个维度联合优化KV缓存。我们通过计算自注意力层前后输入提示差异的余弦相似度来评估每层的重要性,并据此调整KV预算。通过这种优化,SqueezeAttention在多种LLM和基准测试中实现了约30%到70%的内存减少和高达2.2倍的吞吐量提升。
🔬 方法详解
问题定义:本文旨在解决现有KV缓存压缩算法未能根据层的重要性动态分配预算的问题,导致推理效率低下和资源浪费。
核心思路:通过计算自注意力层前后输入提示差异的余弦相似度,评估每层的重要性,并据此优化KV缓存的分配,以实现更高效的推理。
技术框架:整体流程包括:首先评估每层的重要性,然后将层分为两组,最后根据组别调整KV预算,并结合三种序列压缩算法对每层进行优化。
关键创新:SqueezeAttention的创新在于其层级优化的KV缓存管理策略,能够根据层的重要性动态调整预算,显著提升推理效率。
关键设计:关键设计包括使用余弦相似度来评估层的重要性,并根据评估结果对KV预算进行动态调整,同时结合多种序列压缩算法以适应不同层的需求。
🖼️ 关键图片
📊 实验亮点
实验结果显示,SqueezeAttention在多个大型语言模型和基准测试中实现了30%到70%的内存减少,同时吞吐量提升高达2.2倍,显著优于现有的KV缓存管理方法,展示了其在推理效率上的巨大潜力。
🎯 应用场景
该研究在大型语言模型的推理优化中具有广泛的应用潜力,能够显著降低内存消耗并提升处理速度,适用于自然语言处理、对话系统及其他需要高效推理的AI应用场景。未来,该方法可能推动更高效的模型设计和部署,促进AI技术的普及与应用。
📄 摘要(原文)
Optimizing the Key-Value (KV) cache of the Large Language Model (LLM) has been considered critical to saving the cost of inference. Most of the existing KV-cache compression algorithms attempted to sparsify the sequence of tokens by taking advantage of the different importance of tokens. However, most of these methods treat all layers equally, allocating the same KV budget to each layer. This approach is suboptimal, as some layers may be less sensitive to input tokens yet still receive the same budget as others. In this work, we found that by identifying the importance of attention layers, we could optimize the KV-cache jointly from two dimensions, i.e., sequence-wise and layer-wise. Based on our observations regarding layer-wise importance in inference, we propose SqueezeAttention to precisely optimize the allocation of KV-cache budget among layers on-the-fly and then incorporate three representative sequence-wise algorithms to compress the KV-cache for each layer with its very own budget. Specifically, we first measure each layer's importance by calculating the cosine similarity of the input prompt differences before and after the self-attention layers. Based on this similarity, we then categorize the layers into two groups and adjust their KV budgets accordingly. By optimizing the KV-cache from both sequence's and layer's dimensions, SqueezeAttention achieves around 30% to 70% of the memory reductions and up to 2.2 times of throughput improvements in a wide range of LLMs and benchmarks. The code is available at https://github.com/hetailang/SqueezeAttention.