PTQTP: Post-Training Quantization to Trit-Planes for Large Language Models

作者: He Xiao, Runming Yang, Qingyao Yang, Wendong Xu, Zhen Li, Yupeng Su, Zhengwu Liu, Hongxia Yang, Ngai Wong

分类: cs.LG, cs.AI

发布日期: 2025-09-21 (更新: 2026-01-01)

备注: Ternary Quantization, Under review

🔗 代码/项目: GITHUB

💡 一句话要点

提出PTQTP：一种面向大语言模型的后训练三元平面量化方法，实现高效推理。

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 后训练量化 大语言模型 三元量化 低比特量化 模型压缩

📋 核心要点

现有超低比特量化方法依赖二值近似或量化感知训练，面临表征能力有限或训练资源开销巨大的挑战。
PTQTP将权重矩阵分解为双重三元trit-planes，解耦权重为离散拓扑和连续幅度，实现无乘法器的加法推理。
实验表明，PTQTP在多种LLM上显著优于sub-4bit PTQ方法，推理速度提升明显，且量化时间大幅缩短。

📝 摘要（中文）

本文提出了一种名为PTQTP（PTQ to Trit-Planes）的结构化后训练量化（PTQ）框架，用于将大语言模型（LLMs）量化到极低的比特宽度。该方法将权重矩阵分解为双重三元{-1, 0, 1} trit-planes，通过将权重解耦为离散拓扑（trit-planes）和连续幅度（scales），实现了无乘法器的加法推理，从而实现了高保真的稀疏近似。PTQTP提供：（1）一个理论上可靠的渐进近似算法，确保全局权重一致性；（2）无需架构修改的模型无关部署；（3）消除混合精度开销的统一三元运算。在LLaMA3.x和Qwen3（0.6B-70B）上的综合实验表明，PTQTP在语言推理任务、数学推理以及编码方面显著优于sub-4bit PTQ方法。PTQTP的性能可与1.58-bit QAT相媲美，但仅需单小时量化，而基于训练的方法需要10-14个GPU days，并且端到端推理速度比FP16基线模型快4.63倍，为资源受限环境中的高效LLM部署建立了一种新的实用解决方案。

🔬 方法详解

问题定义：论文旨在解决大语言模型（LLMs）后训练量化（PTQ）到极低比特宽度时，计算效率和表征能力之间的根本矛盾。现有方法，如二值近似或量化感知训练（QAT），要么表征能力不足，要么需要巨大的训练资源开销，限制了其在资源受限环境中的应用。

核心思路：PTQTP的核心思路是将权重矩阵分解为双重三元{-1, 0, 1} trit-planes。通过这种分解，权重被解耦为离散的拓扑结构（trit-planes）和连续的幅度（scales）。这种解耦使得可以使用加法运算代替乘法运算进行推理，从而在保持较高表征能力的同时，显著降低计算复杂度。

技术框架：PTQTP框架包含以下主要步骤：1. 权重矩阵分解：将原始权重矩阵分解为多个三元trit-planes和对应的缩放因子。2. 渐进近似算法：使用理论上可靠的渐进近似算法来优化trit-planes和缩放因子，确保全局权重一致性。3. 推理加速：利用三元运算进行高效推理，无需乘法运算。该框架无需修改模型架构，即可直接部署。

关键创新：PTQTP的关键创新在于其结构化的量化方法，将权重分解为trit-planes。与传统的低比特量化方法相比，PTQTP通过解耦权重，实现了高保真的稀疏近似，从而在极低比特宽度下保持了较好的模型性能。此外，PTQTP采用统一的三元运算，消除了混合精度带来的额外开销。

关键设计：PTQTP的关键设计包括：1. 双重三元trit-planes：使用两个三元平面来表示权重，提高了表征能力。2. 渐进近似算法：该算法通过迭代优化trit-planes和缩放因子，确保量化后的权重与原始权重之间的误差最小化。3. 模型无关性：PTQTP的设计使其可以应用于各种LLM，无需针对特定模型进行调整。

🖼️ 关键图片

📊 实验亮点

实验结果表明，PTQTP在LLaMA3.x和Qwen3（0.6B-70B）上显著优于sub-4bit PTQ方法，在语言推理、数学推理和编码任务上均取得了优异的性能。PTQTP的性能可与1.58-bit QAT相媲美，但量化时间仅需单小时，而QAT需要10-14个GPU days。端到端推理速度比FP16基线模型快4.63倍。

🎯 应用场景

PTQTP适用于资源受限环境中的大语言模型部署，例如移动设备、边缘计算设备等。该方法可以显著降低模型的计算复杂度和存储空间需求，从而使得LLM能够在这些设备上高效运行。此外，PTQTP还可以应用于对延迟敏感的应用场景，例如实时对话系统、智能助手等。

📄 摘要（原文）

Post-training quantization (PTQ) of large language models (LLMs) to extremely low bit-widths remains challenging due to the fundamental trade-off between computational efficiency and representational capacity. While existing ultra-low-bit methods rely on binary approximations or quantization-aware training(QAT), they often suffer from either limited representational capacity or huge training resource overhead. We introduce PTQ to Trit-Planes (PTQTP), a structured PTQ framework that decomposes weight matrices into dual ternary {-1, 0, 1} trit-planes. This approach achieves multiplication-free additive inference by decoupling weights into discrete topology (trit-planes) and continuous magnitude (scales), effectively enabling high-fidelity sparse approximation. PTQTP provides: (1) a theoretically grounded progressive approximation algorithm ensuring global weight consistency; (2) model-agnostic deployment without architectural modifications; and (3) uniform ternary operations that eliminate mixed-precision overhead. Comprehensive experiments on LLaMA3.x and Qwen3 (0.6B-70B) demonstrate that PTQTP significantly outperforms sub-4bit PTQ methods on both language reasoning tasks and mathematical reasoning as well as coding. PTQTP rivals the 1.58-bit QAT performance while requiring only single-hour quantization compared to 10-14 GPU days for training-based methods, and the end-to-end inference speed achieves 4.63$\times$ faster than the FP16 baseline model, establishing a new and practical solution for efficient LLM deployment in resource-constrained environments. Code will available at https://github.com/HeXiao-55/PTQTP.

PTQTP: Post-Training Quantization to Trit-Planes for Large Language Models

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理