FGMP: Fine-Grained Mixed-Precision Weight and Activation Quantization for Hardware-Accelerated LLM Inference

📄 arXiv: 2504.14152v1 📥 PDF

作者: Coleman Hooper, Charbel Sakr, Ben Keller, Rangharajan Venkatesan, Kurt Keutzer, Sophia Shao, Brucek Khailany

分类: cs.AR, cs.LG

发布日期: 2025-04-19


💡 一句话要点

提出FGMP以解决大语言模型推理效率问题

🎯 匹配领域: 支柱九:具身大模型 (Embodied Foundation Models)

关键词: 量化 大语言模型 混合精度 推理效率 硬件加速 能耗优化 模型压缩

📋 核心要点

  1. 现有的量化方法在将大语言模型的权重和激活量化为低精度时,常常导致模型准确性下降。
  2. 论文提出了一种细粒度混合精度量化方法,通过识别需要保持高精度的权重和激活块,来最小化模型损失的扰动。
  3. 实验结果表明,FGMP量化在Llama-2-7B模型上实现了<1%的困惑度下降,同时推理能耗减少14%,权重内存需求降低30%。

📝 摘要(中文)

量化是一种有效提升大语言模型(LLM)推理效率的工具,通过利用更节能的低精度数据通路并减少内存占用。然而,准确地将LLM的权重和激活量化为低精度而不降低模型准确性是具有挑战性的。我们提出了细粒度混合精度(FGMP)量化,这是一种后训练混合精度量化的硬件-软件协同设计方法,能够在将大部分权重和激活量化为低精度的同时保持准确性。我们的工作包括:1) 开发了一种策略,利用每个值的扰动,结合Fisher信息,选择哪些权重和激活块保持高精度;2) 提出了一种敏感性加权剪裁方法,以帮助保持低精度量化块的准确性;3) 提出硬件增强以利用FGMP量化的效率优势。我们的设计在Llama-2-7B模型上实现了<1%的困惑度下降,同时推理时能耗减少14%,权重内存需求降低30%。

🔬 方法详解

问题定义:本论文旨在解决大语言模型推理中的量化问题,现有方法在低精度量化时常常导致模型准确性显著下降,影响实际应用效果。

核心思路:提出细粒度混合精度(FGMP)量化方法,通过分析每个权重和激活块的扰动,结合Fisher信息,选择性地保持部分块在高精度,以此来保持模型的整体准确性。

技术框架:FGMP量化的整体架构包括三个主要模块:1) 权重和激活块的选择策略;2) 敏感性加权剪裁方法;3) 硬件实现模块,支持在块粒度上进行FGMP量化。

关键创新:最重要的创新在于提出了一种基于Fisher信息的扰动选择策略,能够有效识别哪些权重和激活块需要保持高精度,从而最大限度地减少模型损失的影响。

关键设计:在设计中,使用NVFP4作为低精度数据类型,FP8作为高精度数据类型,确保在量化过程中能够有效地平衡性能和能耗,同时实现动态的激活块精度分配。

🖼️ 关键图片

fig_0
fig_1
fig_2

📊 实验亮点

实验结果显示,FGMP量化在Llama-2-7B模型上实现了<1%的困惑度下降,相较于全FP8基线设计,推理能耗减少14%,权重内存需求降低30%,展现出显著的性能提升。

🎯 应用场景

该研究的潜在应用领域包括自然语言处理、机器翻译和对话系统等需要高效推理的大语言模型。通过提高推理效率和降低能耗,FGMP量化方法能够在移动设备和边缘计算环境中实现更广泛的应用,推动智能应用的发展。

📄 摘要(原文)

Quantization is a powerful tool to improve large language model (LLM) inference efficiency by utilizing more energy-efficient low-precision datapaths and reducing memory footprint. However, accurately quantizing LLM weights and activations to low precision is challenging without degrading model accuracy. We propose fine-grained mixed precision (FGMP) quantization, a post-training mixed-precision quantization hardware-software co-design methodology that maintains accuracy while quantizing the majority of weights and activations to reduced precision. Our work makes the following contributions: 1) We develop a policy that uses the perturbation in each value, weighted by the Fisher information, to select which weight and activation blocks to keep in higher precision. This approach preserves accuracy by identifying which weight and activation blocks need to be retained in higher precision to minimize the perturbation in the model loss. 2) We also propose a sensitivity-weighted clipping approach for fine-grained quantization which helps retain accuracy for blocks that are quantized to low precision. 3) We then propose hardware augmentations to leverage the efficiency benefits of FGMP quantization. Our hardware implementation encompasses i) datapath support for FGMP at block granularity, and ii) a mixed-precision activation quantization unit to assign activation blocks to high or low precision on the fly with minimal runtime and energy overhead. Our design, prototyped using NVFP4 (an FP4 format with microscaling) as the low-precision datatype and FP8 as the high-precision datatype, facilitates efficient FGMP quantization, attaining <1% perplexity degradation on Wikitext-103 for the Llama-2-7B model relative to an all-FP8 baseline design while consuming 14% less energy during inference and requiring 30% less weight memory.