One Pool, Two Caches: Adaptive HBM Partitioning for Accelerating Generative Recommender Serving

作者: Wenjun Yu, Shuguang Han, Amelie Chi Zhou

分类: cs.DC, cs.IR, cs.LG

发布日期: 2026-05-06

💡 一句话要点

提出HELM以解决生成推荐系统中的内存分配问题

🎯 匹配领域: 支柱一：机器人控制 (Robot Control) 支柱二：RL算法与架构 (RL & Architecture)

关键词: 生成推荐 内存管理 动态调度 深度学习 性能优化

📋 核心要点

现有方法在生成推荐系统中未能有效平衡EMB和KV缓存的内存分配，导致性能损失。
论文提出HELM，通过自适应内存分配和EMB-KV感知调度，动态优化内存分配和请求路由。
实验结果表明，HELM在P99延迟上减少了24-38%，并在多种工作负载下实现了高达99.6%的SLO满足率。

📝 摘要（中文）

生成推荐系统（GR）推理中，嵌入热缓存（EMB）和KV缓存在有限的GPU HBM中相互竞争：为其中一个分配更多内存会提高其效率，但会降低另一个的性能。现有系统孤立优化这两者，忽视了在不同工作负载下，最佳的EMB-KV分配比例可能变化高达0.35，导致20-30%的延迟改善未能实现。为了解决这一问题，本文提出了HELM，通过两个关键组件在运行时联合管理HBM分配和请求路由：1）自适应内存分配，基于三层PPO控制器实现32μs的决策延迟；2）EMB-KV感知调度，综合考虑KV驻留、嵌入局部性和节点负载来避免路由低效。评估结果显示，HELM在三种生产规模数据集上减少了24-38%的P99延迟，并在不同工作负载下实现了93.5-99.6%的SLO满足率，显著优于现有基线。

🔬 方法详解

问题定义：本文旨在解决生成推荐系统中EMB和KV缓存的内存分配问题。现有方法往往孤立优化这两者，未能考虑它们之间的动态关系，导致性能未能达到最佳。

核心思路：HELM通过自适应内存分配和请求路由的联合管理，动态调整EMB和KV的内存分配比例，以适应不同的工作负载，从而提高整体性能。

技术框架：HELM的整体架构包括两个主要模块：自适应内存分配模块和EMB-KV感知调度模块。自适应内存分配使用三层PPO控制器进行实时决策，而调度模块则综合考虑KV驻留、嵌入局部性和节点负载进行请求路由。

关键创新：HELM的核心创新在于其动态内存分配策略和请求路由机制，能够实时适应工作负载变化，显著提高了资源利用率和响应速度。与现有方法相比，HELM能够更有效地平衡EMB和KV的内存需求。

关键设计：在自适应内存分配中，采用了三层PPO控制器，确保决策延迟低至32μs，并保持在0.024-0.029的离线最优比例范围内。调度模块则通过分析请求的KV驻留和嵌入局部性，优化了请求的路由路径。

🖼️ 关键图片

📊 实验亮点

HELM在三种生产规模数据集上实现了24-38%的P99延迟降低，相较于最佳静态策略，显著提升了性能。同时，在Steady、Trend和Burst工作负载下，SLO满足率高达93.5-99.6%，表现优于现有最先进的基线方法。

🎯 应用场景

该研究的潜在应用领域包括在线推荐系统、个性化广告投放和内容推荐等场景。通过优化内存分配和请求路由，HELM能够显著提升系统的响应速度和用户体验，具有广泛的实际价值和未来影响。

📄 摘要（原文）

Generative Recommender (GR) inference places embedding hot caches (EMB) and KV caches in direct competition for limited GPU HBM: allocating more memory to one improves its efficiency but degrades the other. Existing systems optimize them in isolation, overlooking that the optimal EMB-KV allocation ratio can shift by up to 0.35 across workload regimes, leaving 20-30\% latency improvement unrealized. While online reallocation is required to close this gap, naive approaches introduce H2D refill traffic on the critical path, causing P99 SLO violations. To address this, we present HELM, which jointly manages HBM allocation and request routing at runtime through two key components: (1) Adaptive Memory Allocation, a three-layer PPO-based controller (frozen base policy, online residual adapter, and burst-aware recovery controller) that achieves $32\,\mathrm{μs}$ decision latency while staying within 0.024-0.029 of the offline-optimal ratio; and (2) EMB-KV-Aware Scheduling, which routes requests by jointly considering KV residency, embedding locality, and node load to avoid routing inefficiencies under heterogeneous allocations. Evaluations on three production-scale datasets over a 32-node A100 cluster show that HELM reduces P99 latency by 24-38\% over the best static policy and achieves 93.5-99.6\% SLO satisfaction across Steady, Trend, and Burst workloads, significantly outperforming state-of-the-art baselines without sacrificing throughput.

One Pool, Two Caches: Adaptive HBM Partitioning for Accelerating Generative Recommender Serving

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理