Don't Waste Bits! Adaptive KV-Cache Quantization for Lightweight On-Device LLMs

作者: Sayed Pedram Haeri Boroujeni, Niloufar Mehrabi, Patrick Woods, Gabriel Hillesheim, Abolfazl Razi

分类: cs.CV

发布日期: 2026-04-07

💡 一句话要点

提出自适应KV-Cache量化方法，优化轻量级On-Device LLM的内存和延迟。

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: KV-Cache量化 自适应量化 On-Device LLM 轻量级模型 模型压缩

📋 核心要点

现有KV-cache量化方法采用固定精度或人工启发式策略，导致资源浪费和精度损失。
提出自适应KV-cache量化方法，根据token的重要性动态调整量化精度，优化内存和延迟。
实验表明，该方法在精度、延迟和内存占用方面均优于静态量化方法，并接近FP16推理的性能。

📝 摘要（中文）

大型语言模型(LLMs)在推理、生成和决策任务中取得了显著进展，但将其部署在移动、嵌入式和边缘设备上仍然极具挑战性。On-device LLM推理受到键值(KV)缓存的内存和带宽开销的严重限制，KV缓存随着上下文长度线性增长，并且通常主导解码成本。现有的KV-cache量化方案通常依赖于固定精度或手工设计的启发式方法，从而在低影响的token上浪费bits，同时过度压缩信息丰富的token，导致可避免的精度下降。受霍夫曼编码的可变长度分配原则的启发，我们提出自适应KV-cache量化，这是一种学习策略，它分配与token重要性成比例的bit宽度，从而在不牺牲竞争精度的前提下，最小化预期内存和延迟。我们的框架提取轻量级的token级别特征，包括token频率、质量分数、注意力方差和基于熵的不确定性，并将它们输入到一个紧凑的数据驱动控制器中，该控制器在解码期间动态地从{2-bit, 4-bit, 8-bit, FP16}中选择KV精度。这种自适应精度策略降低了KV内存占用和延迟，同时提高了精度，与静态KV量化和基于规则的基线相比，并且在标准LLM基准测试中保持了接近FP16推理的竞争精度。使用SmolLM-135M、SmolLM-360M和SmolLM-1.7B在多个常识推理基准测试中进行的大量实验表明，我们的控制器始终提高了精度-延迟的权衡。

🔬 方法详解

问题定义：论文旨在解决在资源受限的设备上部署大型语言模型时，KV-cache带来的巨大内存和延迟开销问题。现有方法如固定精度量化，无法有效平衡不同token的重要性，导致精度损失或资源浪费。

核心思路：论文的核心思路是借鉴霍夫曼编码的思想，根据token的重要性自适应地分配量化bit宽度。重要性高的token使用更高的精度，而重要性低的token使用更低的精度，从而在保证精度的前提下，降低整体的内存占用和延迟。

技术框架：该框架包含两个主要部分：特征提取模块和自适应量化控制器。特征提取模块负责提取token级别的特征，包括token频率、质量分数、注意力方差和基于熵的不确定性。自适应量化控制器是一个轻量级的数据驱动模型，它根据提取的特征动态地选择每个token的量化精度（2-bit, 4-bit, 8-bit, FP16）。在解码过程中，每个token的KV值根据控制器选择的精度进行量化，从而实现自适应的KV-cache量化。

关键创新：该论文的关键创新在于提出了自适应的KV-cache量化策略，能够根据token的重要性动态调整量化精度。与传统的固定精度量化方法相比，该方法能够更有效地利用有限的资源，在保证精度的前提下，显著降低内存占用和延迟。

关键设计：自适应量化控制器的设计是关键。论文使用轻量级的神经网络作为控制器，输入是token级别的特征，输出是量化精度。控制器的训练目标是最小化内存占用和延迟，同时保持较高的精度。具体实现细节包括特征的选择、网络结构的设置、损失函数的设计等。

🖼️ 关键图片

📊 实验亮点

实验结果表明，该方法在SmolLM-360M模型上，相对于静态KV量化，解码延迟降低了17.75%，精度提高了7.60个百分点，并且精度仅比FP16推理低0.30个百分点。在多个常识推理基准测试中，该方法均表现出优异的精度-延迟权衡。

🎯 应用场景

该研究成果可广泛应用于移动设备、嵌入式系统和边缘设备等资源受限的场景，加速大型语言模型的部署和应用。例如，可以将该方法应用于智能手机上的本地LLM推理，提升响应速度和用户体验，或应用于物联网设备上的智能助手，实现更高效的自然语言交互。

📄 摘要（原文）

Large Language Models (LLMs) have achieved remarkable progress across reasoning, generation, and decision-making tasks, yet deploying them on mobile, embedded, and edge devices remains particularly challenging. On-device LLM inference is heavily constrained by the memory and bandwidth overhead of the key-value (KV) cache, which grows linearly with context length and often dominates decoding cost. Existing KV-cache quantization schemes typically rely on fixed precision or hand-crafted heuristics, thereby wasting bits on low-impact tokens while over-compressing informative ones, leading to avoidable accuracy degradation. Inspired by Huffman coding's principle of variable-length allocation, we propose adaptive KV-cache quantization, a learned policy that assigns bit-width proportional to token importance, minimizing expected memory and latency without sacrificing competitive accuracy. Our framework extracts lightweight token-level features, including token frequency, quality score, attention variance, and entropy-based uncertainty, and feeds them into a compact data-driven controller that dynamically selects KV precision from {2-bit, 4-bit, 8-bit, FP16} during decoding. This adaptive precision policy reduces KV memory footprint and latency while improving accuracy compared to static KV quantization and rule-based baselines, and maintaining competitive accuracy close to FP16 inference across standard LLM benchmarks. Extensive experiments across multiple commonsense reasoning benchmarks using SmolLM-135M, SmolLM-360M, and SmolLM-1.7B demonstrate that our controller consistently improves the accuracy-latency trade-off. For instance, with SmolLM-360M on HellaSwag, our method reduces decoding latency (ms/token) by 17.75% relative to static KV quantization, improves accuracy by 7.60 points, and remains within only 0.30 points of FP16 inference.

Don't Waste Bits! Adaptive KV-Cache Quantization for Lightweight On-Device LLMs

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理