ZeroQuant(4+2): Redefining LLMs Quantization with a New FP6-Centric Strategy for Diverse Generative Tasks

📄 arXiv: 2312.08583v2 📥 PDF

作者: Xiaoxia Wu, Haojun Xia, Stephen Youn, Zhen Zheng, Shiyang Chen, Arash Bakhtiari, Michael Wyatt, Reza Yazdani Aminabadi, Yuxiong He, Olatunji Ruwase, Leon Song, Zhewei Yao

分类: cs.CL, stat.ML

发布日期: 2023-12-14 (更新: 2023-12-18)


💡 一句话要点

提出FP6中心策略以优化LLMs的量化方法

🎯 匹配领域: 支柱九:具身大模型 (Embodied Foundation Models)

关键词: 量化方法 大型语言模型 FP6 生成任务 代码生成 抽象摘要 AI硬件

📋 核心要点

  1. 现有的4位量化方法如GPTQ在大型语言模型中存在过拟合问题,且在零-shot任务中的提升有限。
  2. 论文提出了一种新的FP6中心量化策略,旨在通过4+2设计优化量化性能,以适应多种AI硬件。
  3. 实验结果显示,FP6在代码生成和摘要任务中表现优异,尤其在406M模型中接近基线,超越了INT4的性能。

📝 摘要(中文)

本研究探讨了大型语言模型(LLMs)中4位量化方法如GPTQ的局限性,指出其在零-shot任务中的过拟合和提升有限。我们扩展了任务范围,涵盖代码生成和抽象摘要等生成类别,发现INT4量化表现不佳。尽管FP6格式在当前AI硬件上因缺乏精细集成和系统加速策略而被忽视,但我们的结果表明,FP6在多种算法和任务中表现出色,准确性和多样性优于INT4。我们提出了一种新颖的4+2设计,使FP6在延迟上与先进的INT4细粒度量化相当,成为当前LLMs中4位量化方法的有前景的解决方案。

🔬 方法详解

问题定义:本研究旨在解决现有4位量化方法在大型语言模型中的性能不足,特别是GPTQ在零-shot任务中的过拟合和提升有限的问题。

核心思路:论文提出了一种新的FP6中心量化策略,利用4+2设计来优化量化性能,克服了高精度格式在当前AI硬件上的集成和加速挑战。

技术框架:整体架构包括FP6量化的设计与实现,结合了多种算法和任务的适应性测试,确保在不同硬件上的高效运行。

关键创新:最重要的技术创新点在于FP6量化的提出及其4+2设计,使其在延迟上与INT4细粒度量化相当,同时提升了准确性和多样性。

关键设计:在设计中,FP6采用了粗粒度量化方案,结合特定的参数设置和损失函数,以确保在多种生成任务中的稳定性和高效性。

📊 实验亮点

实验结果显示,使用FP6量化的 exttt{codestar-15B}模型在代码生成任务中与FP16模型表现相当,而406M模型在摘要任务中接近基线,均显著优于INT4量化,展示了FP6的优势。

🎯 应用场景

该研究的潜在应用领域包括自然语言处理、代码生成和文本摘要等生成任务。FP6量化策略的提出为在多种AI硬件上实现高效的模型部署提供了新的思路,具有重要的实际价值和未来影响。

📄 摘要(原文)

This study examines 4-bit quantization methods like GPTQ in large language models (LLMs), highlighting GPTQ's overfitting and limited enhancement in Zero-Shot tasks. While prior works merely focusing on zero-shot measurement, we extend task scope to more generative categories such as code generation and abstractive summarization, in which we found that INT4 quantization can significantly underperform. However, simply shifting to higher precision formats like FP6 has been particularly challenging, thus overlooked, due to poor performance caused by the lack of sophisticated integration and system acceleration strategies on current AI hardware. Our results show that FP6, even with a coarse-grain quantization scheme, performs robustly across various algorithms and tasks, demonstrating its superiority in accuracy and versatility. Notably, with the FP6 quantization, \codestar-15B model performs comparably to its FP16 counterpart in code generation, and for smaller models like the 406M it closely matches their baselines in summarization. Neither can be achieved by INT4. To better accommodate various AI hardware and achieve the best system performance, we propose a novel 4+2 design for FP6 to achieve similar latency to the state-of-the-art INT4 fine-grain quantization. With our design, FP6 can become a promising solution to the current 4-bit quantization methods used in LLMs.