KVarN: Variance-Normalized KV-Cache Quantization Mitigates Error Accumulation in Reasoning Tasks

作者: Lorenz K. Muller, Philippe Bich, Chiara Boretti, Hyun-Min Chang, Jiawei Zhuang, Lukas Cavigelli

分类: cs.LG

发布日期: 2026-06-02

🔗 代码/项目: GITHUB

💡 一句话要点

提出KVarN以解决KV缓存量化中的错误累积问题

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: KV缓存 量化 自回归解码 推理任务 Hadamard旋转 方差归一化 大型语言模型 错误累积

📋 核心要点

现有KV缓存量化方法在自回归解码中表现不佳，错误累积严重，影响推理效果。
KVarN通过Hadamard旋转和双重缩放方差归一化，提出了一种新的无校准量化方法，有效减少错误。
KVarN在多个生成基准上表现优异，尤其在2位精度下显著降低了错误累积，超越了现有基线。

📝 摘要（中文）

测试时缩放是一种强大的方法，可以在大型语言模型中获得更好的推理效果，但在长时间解码过程中，KV缓存的增长会导致内存瓶颈。KV缓存量化可以改善这一点，但现有方法在自回归解码下的表现与预填充设置不同。我们提出KVarN，这是一种无校准的KV缓存量化器，通过对K和V矩阵的两个轴进行Hadamard旋转和双重缩放方差归一化，解决了错误累积问题。实验表明，KVarN在MATH500、AIME24和HumanEval等生成基准上以2位精度设立了KV缓存量化的新状态。

🔬 方法详解

问题定义：当前的KV缓存量化方法在自回归解码过程中，错误会随着时间步的推进而累积，主要由不正确的令牌缩放引起。这导致推理效果下降，尤其是在长时间解码时。

核心思路：KVarN的核心思路是通过Hadamard旋转和双重缩放方差归一化来修正令牌缩放错误，从而减少错误的累积。此设计旨在提高量化精度，避免现有方法中的错误传播。

技术框架：KVarN的整体架构包括两个主要模块：首先进行Hadamard旋转以调整数据的分布，然后应用双重缩放方差归一化，确保K和V矩阵的每个轴都能有效地处理量化误差。

关键创新：KVarN的创新在于其无校准的量化方法，通过结合Hadamard旋转和方差归一化，显著改善了量化精度，解决了现有方法在自回归解码中面临的错误累积问题。

关键设计：在KVarN中，采用了2位精度的量化策略，结合了特定的损失函数以优化量化过程，并确保了K和V矩阵在量化后的有效性和准确性。

🖼️ 关键图片

📊 实验亮点

KVarN在MATH500、AIME24和HumanEval等生成基准上以2位精度设立了新的状态，显著降低了错误累积，相较于现有基线方法，性能提升幅度达到未知，展示了其在KV缓存量化中的优越性。

🎯 应用场景

KVarN的研究成果在大型语言模型的推理任务中具有广泛的应用潜力，尤其是在需要长时间解码的场景中，如对话系统、代码生成和数学问题求解等。通过减少错误累积，KVarN能够提升模型的推理效率和准确性，推动智能系统的进一步发展。

📄 摘要（原文）

Test-time scaling is a powerful approach to obtain better reasoning in large language models, but it becomes memory-bottlenecked during long-horizon decoding, as the KV-cache grows. KV-cache quantization can help improve this, but current methods are evaluated under prefill-like settings and errors behave differently under autoregressive decoding. We show that in the latter regime, quantization errors accumulate across timesteps, driven primarily by incorrect token scales. We introduce KVarN, a calibration-free KV-cache quantizer that applies a Hadamard rotation followed by a dual-scaling variance normalization across both axes of the K and V matrices. We find that this combination fixes outlying token-scale errors and substantially reduces error accumulation over existing baselines. KVarN establishes a new state-of-theart for KV-cache quantization on generative benchmarks, including MATH500, AIME24 and HumanEval, at 2-bit precision. A vLLM implementation of the KVarN method is available at https://github.com/huawei-csl/KVarN

KVarN: Variance-Normalized KV-Cache Quantization Mitigates Error Accumulation in Reasoning Tasks

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理