ButterflyQuant: Ultra-low-bit LLM Quantization through Learnable Orthogonal Butterfly Transforms

作者: Bingxin Xu, Zhen Dong, Oussama Elachqar, Yuzhang Shang

分类: cs.LG, cs.AI, cs.CL

发布日期: 2025-09-11 (更新: 2025-09-25)

备注: Replace discrete Hadamard transforms with continuous Butterfly transforms to facilitate the learning of rotation matrices in LLM quantization

🔗 代码/项目: GITHUB

💡 一句话要点

提出ButterflyQuant以解决超低比特LLM量化问题

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 量化技术 深度学习 自然语言处理 模型压缩 蝴蝶变换 激活值优化 层自适应

📋 核心要点

现有的量化方法在极端2比特量化时，由于激活值中的异常值，导致性能严重下降。
本文提出ButterflyQuant，通过可学习的蝴蝶变换替代固定的Hadamard变换，实现层自适应旋转，优化量化效果。
在LLaMA-2-7B模型上，ButterflyQuant的困惑度为15.4，显著优于QuIP的37.3，展示了其有效性。

📝 摘要（中文）

大型语言模型需要巨大的内存占用，限制了其在消费硬件上的部署。量化通过降低数值精度来减少内存，但极端的2比特量化因激活值中的异常值而导致性能严重下降。现有的基于旋转的方法如QuIP和QuaRot使用固定的Hadamard变换来消除异常值，无法适应特定的权重分布。本文提出ButterflyQuant，使用可学习的蝴蝶变换替代Hadamard旋转，能够根据不同的变换层自适应调整，显著提高了量化后的模型性能。实验结果表明，ButterflyQuant在LLaMA-2-7B模型上实现了15.4的困惑度，相较于QuIP的37.3有显著提升。

🔬 方法详解

问题定义：本文旨在解决大型语言模型在极端2比特量化时因激活值异常值导致的性能下降问题。现有方法如QuIP和QuaRot使用固定的Hadamard变换，无法适应不同层的权重分布，限制了量化效果。

核心思路：ButterflyQuant的核心思路是使用可学习的蝴蝶变换替代固定的Hadamard变换，通过层自适应的旋转来消除异常值，从而优化量化性能。蝴蝶变换的连续参数化使得优化过程更加平滑，并保证了正交性。

技术框架：ButterflyQuant的整体架构包括数据预处理、可学习的蝴蝶变换模块和量化模块。首先对激活值进行变换，然后通过量化模块将其转换为低比特表示。

关键创新：最重要的技术创新在于引入了可学习的蝴蝶变换，允许模型根据不同层的特征自适应调整变换方式，显著提高了量化后的模型性能。与固定的Hadamard变换相比，蝴蝶变换的连续参数化使得模型能够进行梯度优化。

关键设计：在设计中，蝴蝶变换的参数通过连续的Givens旋转角度进行参数化，损失函数中引入了均匀性正则化，以促进变换后激活值的平滑分布，优化量化效果。

🖼️ 关键图片

📊 实验亮点

在LLaMA-2-7B模型上，ButterflyQuant实现了15.4的困惑度，相较于QuIP的37.3，性能提升显著，展示了其在超低比特量化中的有效性和优势。

🎯 应用场景

ButterflyQuant的研究成果具有广泛的应用潜力，尤其是在资源受限的环境中，如移动设备和边缘计算。通过优化大型语言模型的内存占用，能够使其在更广泛的硬件上进行部署，推动自然语言处理技术的普及与应用。

📄 摘要（原文）

Large language models require massive memory footprints, severely limiting deployment on consumer hardware. Quantization reduces memory through lower numerical precision, but extreme 2-bit quantization suffers from catastrophic performance loss due to outliers in activations. Rotation-based methods such as QuIP and QuaRot apply orthogonal transforms to eliminate outliers before quantization, using computational invariance: $\mathbf{y} = \mathbf{Wx} = (\mathbf{WQ}^T)(\mathbf{Qx})$ for orthogonal $\mathbf{Q}$. However, these methods use fixed transforms--Hadamard matrices achieving optimal worst-case coherence $μ= 1/\sqrt{n}$--that cannot adapt to specific weight distributions. We identify that different transformer layers exhibit distinct outlier patterns, motivating layer-adaptive rotations rather than one-size-fits-all approaches. In this work, we propose ButterflyQuant, which replaces Hadamard rotations with learnable butterfly transforms parameterized by continuous Givens rotation angles. Unlike Hadamard's discrete ${+1, -1}$ entries that are non-differentiable and thus prohibit gradient-based learning, butterfly transforms' continuous parameterization enables smooth optimization while guaranteeing orthogonality by construction. This orthogonal constraint ensures theoretical guarantees in outlier suppression while achieving $O(n \log n)$ computational complexity with only $\frac{n \log n}{2}$ learnable parameters. We further introduce a uniformity regularization on post-transformation activations to promote smoother distributions amenable to quantization. Learning requires only 128 calibration samples and converges in minutes on a single GPU--a negligible one-time cost. For LLaMA-2-7B with 2-bit quantization, ButterflyQuant achieves 15.4 perplexity versus 37.3 for QuIP. \href{https://github.com/42Shawn/Butterflyquant-llm}{Codes} are available.

ButterflyQuant: Ultra-low-bit LLM Quantization through Learnable Orthogonal Butterfly Transforms

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理