R-KV: Redundancy-aware KV Cache Compression for Reasoning Models

作者: Zefan Cai, Wen Xiao, Hanshi Sun, Cheng Luo, Yikai Zhang, Ke Wan, Yucheng Li, Yeyang Zhou, Li-Wen Chang, Jiuxiang Gu, Zhen Dong, Anima Anandkumar, Abedelkadir Asi, Junjie Hu

分类: cs.CL, cs.AI

发布日期: 2025-05-30 (更新: 2025-06-13)

💡 一句话要点

提出R-KV以解决推理模型中的冗余KV缓存压缩问题

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 推理模型 KV缓存 压缩技术 冗余标记 性能优化 内存管理 自然语言处理

📋 核心要点

现有的KV缓存压缩方法在推理模型中无法有效处理冗余标记，导致性能下降和推理失败。
本文提出R-KV方法，专注于识别和压缩推理模型中的冗余标记，从而提高KV缓存的使用效率。
实验结果显示，R-KV在多个数学推理数据集上超越了现有基线，显著提升了性能和内存利用率。

📝 摘要（中文）

推理模型在自我反思和链式推理方面表现出色，但常常产生过长的输出，导致推理时的键值(KV)缓存过大。现有的KV缓存压缩方法在复杂推理任务中表现不佳，容易导致推理失败。为此，本文提出了一种针对推理模型冗余标记的R-KV缓存压缩方法。该方法在仅使用10%的KV缓存的情况下，几乎保留了100%的完整KV缓存性能，显著优于现有基线，后者仅能达到60%的性能。R-KV在使用16%的KV缓存时甚至实现了105%的完整KV缓存性能，带来了90%的内存节省和6.6倍的吞吐量提升。实验结果表明，R-KV在两个数学推理数据集上始终优于现有的KV缓存压缩基线。

🔬 方法详解

问题定义：本文旨在解决推理模型中KV缓存过大导致的性能瓶颈，现有方法在处理冗余标记时效果不佳，影响推理的准确性和效率。

核心思路：R-KV通过识别推理模型中的冗余标记，优化KV缓存的使用，旨在在保持高性能的同时显著减少内存占用。

技术框架：R-KV的整体架构包括冗余标记检测模块、KV缓存压缩模块和性能评估模块。首先检测冗余标记，然后进行缓存压缩，最后评估性能提升。

关键创新：R-KV的主要创新在于其冗余感知机制，能够在极大压缩KV缓存的同时，保持接近完整的推理性能，这一设计与传统方法形成鲜明对比。

关键设计：R-KV在参数设置上进行了优化，采用了特定的损失函数来平衡压缩率与性能，网络结构上则引入了冗余标记的动态识别机制，以提高压缩效果。

📊 实验亮点

R-KV在实验中表现出色，仅使用10%的KV缓存便保留了近100%的性能，使用16%时甚至达到了105%的性能提升。相比之下，现有基线方法的性能仅为60%。此外，R-KV还实现了90%的内存节省和6.6倍的吞吐量提升，显示出其在推理任务中的显著优势。

🎯 应用场景

R-KV方法在推理模型的应用场景中具有广泛的潜力，尤其是在需要高效内存管理和快速推理的领域，如自然语言处理、智能问答系统和复杂决策支持系统。其显著的性能提升和内存节省将推动相关技术的实际应用和发展。

📄 摘要（原文）

Reasoning models have demonstrated impressive performance in self-reflection and chain-of-thought reasoning. However, they often produce excessively long outputs, leading to prohibitively large key-value (KV) caches during inference. While chain-of-thought inference significantly improves performance on complex reasoning tasks, it can also lead to reasoning failures when deployed with existing KV cache compression approaches. To address this, we propose Redundancy-aware KV Cache Compression for Reasoning models (R-KV), a novel method specifically targeting redundant tokens in reasoning models. Our method preserves nearly 100% of the full KV cache performance using only 10% of the KV cache, substantially outperforming existing KV cache baselines, which reach only 60% of the performance. Remarkably, R-KV even achieves 105% of full KV cache performance with 16% of the KV cache. This KV-cache reduction also leads to a 90% memory saving and a 6.6X throughput over standard chain-of-thought reasoning inference. Experimental results show that R-KV consistently outperforms existing KV cache compression baselines across two mathematical reasoning datasets.

R-KV: Redundancy-aware KV Cache Compression for Reasoning Models

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册