LoKA: Low-precision Kernel Applications for Recommendation Models At Scale
作者: Liang Luo, Yinbin Ma, Quanyu Zhu, Vasiliy Kuznetsov, Yuxin Chen, Jian Jiao, Jiecao Yu, Buyun Zhang, Tongyi Tang, Xiaohan Wei, Yanli Zhao, Zeliang Chen, Yuchen Hao, Venkatesh Ranganathan, Sandeep Parab, Yantao Yao, Maxim Naumov, Chunzhi Yang, Shen Li, Ellie Wen, Wenlin Chen, Santanu Kolay, Chunqiang Tang
分类: cs.LG, cs.AI
发布日期: 2026-05-11
备注: Accepted to ISCA'26
💡 一句话要点
提出LoKA框架,通过系统与模型协同设计实现大规模推荐模型的高效FP8训练
🎯 匹配领域: 支柱九:具身大模型 (Embodied Foundation Models)
关键词: 大规模推荐模型 FP8量化 系统模型协同设计 深度学习训练优化 异构计算 数值稳定性
📋 核心要点
- LRM对数值精度敏感,且计算特征以小矩阵乘法为主,直接应用FP8会导致精度损失及训练效率下降。
- LoKA采用系统-模型协同设计,通过在线统计分析、模型结构适配及智能内核调度,实现FP8的精准应用。
- 该方法有效平衡了模型精度与计算性能,在保持推荐模型质量的同时,显著提升了大规模训练的吞吐量。
📝 摘要(中文)
现代GPU通过低精度算术(如FP8)显著提升了浮点运算能力,但其在大规模推荐模型(LRM)中的应用仍受限。LRM对数值精度高度敏感,计算模式以小矩阵乘法(GEMM)和归一化为主,且训练环境通信密集。直接应用FP8往往导致模型质量下降及训练时间延长。为此,本文提出了LoKA(Low-precision Kernel Applications)框架,通过系统与模型协同设计解决上述挑战。LoKA包含三个核心原则:基于真实分布的性能分析以识别低精度安全区域;通过模型组件与硬件的协同设计扩大安全范围;以及跨内核库的调度以最大化收益。具体而言,LoKA由LoKA Probe(在线基准测试)、LoKA Mods(模型适配)和LoKA Dispatch(运行时调度)三个模块组成,实现了FP8在LRM中的实用化。
🔬 方法详解
问题定义:LRM训练面临数值敏感性与计算模式(小GEMM、归一化)的挑战,导致通用FP8优化方案在推荐场景下表现不佳,直接量化往往引发模型收敛问题或通信瓶颈。
核心思路:论文主张“系统-模型协同设计”,即不盲目追求全模型FP8化,而是通过统计分析识别精度敏感层,并结合模型结构调整与动态内核调度,在保证精度的前提下最大化硬件算力利用率。
技术框架:LoKA包含三个核心组件:LoKA Probe负责在线学习激活值与权重统计信息,量化各层误差;LoKA Mods提供可重用的模型适配方案,增强数值稳定性;LoKA Dispatch作为运行时调度器,根据Probe的统计结果选择最优FP8内核。
关键创新:引入了统计驱动的细粒度精度控制机制,将传统的“全量量化”转变为“按需量化”,并结合模型层面的结构优化,解决了FP8在推荐模型中难以落地的工程难题。
关键设计:LoKA Probe利用统计学方法评估FP8带来的量化误差,从而精确划分“安全”与“不安全”的计算区域;LoKA Mods通过调整归一化层或激活函数,提升模型对低精度的鲁棒性,确保在FP8下仍能维持模型质量。
📊 实验亮点
LoKA通过系统与模型协同设计,成功在保持推荐模型精度(AUC/LogLoss)不下降的前提下,利用FP8算力显著提升了训练吞吐量。实验表明,该方法在处理大规模稀疏特征及复杂MLP结构时,相比传统FP16训练,在计算密集型算子上实现了显著的性能加速,并有效规避了数值溢出问题。
🎯 应用场景
该研究主要应用于超大规模推荐系统(如广告点击率预测、内容推荐引擎)的训练场景。在工业界,随着模型参数量达到TB级,LoKA能够显著降低训练成本、缩短迭代周期,对于提升大规模分布式深度学习平台的资源利用率具有深远的工程价值。
📄 摘要(原文)
Recent GPU generations deliver significantly higher FLOPs using lower-precision arithmetic, such as FP8. While successfully applied to large language models (LLMs), its adoption in large recommendation models (LRMs) has been limited. This is because LRMs are numerically sensitive, dominated by small matrix multiplications (GEMMs) followed by normalization, and trained in communication-intensive environments. Applying FP8 directly to LRMs often degrades model quality and prolongs training time. These challenges are inherent to LRM workloads and cannot be resolved merely by introducing better FP8 kernels. Instead, a system-model co-design approach is needed to successfully integrate FP8. We present LoKA (Low-precision Kernel Applications), a framework that makes FP8 practical for LRMs through three principles: profile under realistic distributions to know where low precision is safe, co-design model components with hardware to expand where it is safe, and orchestrate across kernel libraries to maximize the gains. Concretely, LoKA Probe is a statistically grounded, online benchmarking method that learns activation and weight statistics, and quantifies per-layer errors. This process pinpoints safe and unsafe, fast and slow sites for FP8 adoption. LoKA Mods is a set of reusable model adaptations that improve both numerical stability and execution efficiency with FP8. LoKA Dispatch is a runtime that leverages the statistical insights from LoKA Probe to select the fastest FP8 kernel that satisfies the accuracy requirements.