Hybrid Gated Flow (HGF): Stabilizing 1.58-bit LLMs via Selective Low-Rank Correction

作者: David Alejandro Trejo Pizzo

分类: cs.LG, cs.AI, cs.CL

发布日期: 2026-02-05

备注: 21 pages, 4 figures, 6 tables. Code and models will be released at opencores.ai

💡 一句话要点

提出混合门控流（HGF），通过选择性低秩校正稳定1.58位大语言模型。

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 大语言模型 量化 低秩校正 边缘计算 混合精度 模型压缩 自适应门控

📋 核心要点

边缘设备部署LLM受限于内存带宽，1.58位量化虽减小内存占用，但精度损失显著。
HGF采用双流架构，结合1.58位骨干网络和低秩FP16校正路径，通过门控机制动态调整。
实验表明，HGF能有效恢复精度，缩小与FP16基线的差距，且具有良好的训练稳定性。

📝 摘要（中文）

本文提出混合门控流（HGF），一种双流架构，将1.58位三元骨干网络与可学习的、低秩FP16校正路径相结合，并通过自适应门控进行控制。该方法旨在解决边缘设备上大语言模型（LLM）部署受限于“内存墙”的问题，即内存带宽而非计算成为瓶颈。实验表明，在TinyStories数据集上，经过2500和3500步训练，HGF 5.4的验证损失为0.9306，相比BitNet的1.0294，恢复了纯三元量化与FP16基线（0.8490）之间约55%的质量差距，且仅增加了约12-15%的内存开销。此外，实验还观察到量化作为结构正则化的涌现现象。初步结果表明，该架构在SlimPajama和FineWeb-Edu上训练的1.2B和3B参数模型上具有可扩展性。

🔬 方法详解

问题定义：论文旨在解决将大语言模型部署到边缘设备时，由于内存带宽限制（“内存墙”）而导致的性能瓶颈问题。现有的1.58位量化技术虽然能显著降低内存占用，但会造成20-25%的困惑度下降，即精度损失较大。

核心思路：论文的核心思路是利用一个低精度的三元（1.58位）骨干网络来保持内存效率，同时引入一个可学习的、低秩的FP16校正路径来弥补精度损失。通过自适应门控机制，动态地控制校正路径的激活程度，从而在内存占用和精度之间取得平衡。

技术框架：HGF是一个双流架构，包含一个1.58位三元骨干网络和一个低秩FP16校正路径。三元骨干网络负责处理大部分计算，保持内存效率。FP16校正路径则用于学习对三元骨干网络输出的校正，以提高精度。自适应门控机制根据输入动态地调整校正路径的激活程度。

关键创新：HGF的关键创新在于将低精度量化与低秩校正相结合，并引入自适应门控机制。这种混合方法能够在保持内存效率的同时，显著提高模型精度。此外，论文还观察到量化作为结构正则化的现象，即低精度量化有助于提高模型的训练稳定性。

关键设计：FP16校正路径采用低秩分解，以减少参数量和计算量。自适应门控机制使用sigmoid函数来控制校正路径的激活程度，其输入是三元骨干网络的输出。损失函数包括语言建模损失和正则化项，用于约束校正路径的参数。

📊 实验亮点

实验结果表明，HGF 5.4在TinyStories数据集上，验证损失为0.9306，相比BitNet的1.0294，恢复了约55%的精度损失。同时，HGF仅增加了约12-15%的内存开销。此外，HGF还表现出良好的训练稳定性，即使在全精度差分注意力基线（Diff_Only）训练不稳定时，HGF仍能保持收敛。

🎯 应用场景

HGF架构可应用于边缘设备上大语言模型的部署，例如智能手机、物联网设备等。通过降低内存占用和提高模型精度，HGF能够使这些设备运行更强大的AI模型，从而实现更智能化的应用，如本地语音识别、机器翻译、智能助手等。该研究还有助于推动低精度量化技术的发展，为未来更高效的AI模型设计提供新的思路。

📄 摘要（原文）

The deployment of Large Language Models (LLMs) on edge devices is fundamentally constrained by the "Memory Wall" -- a hardware limitation where memory bandwidth, not compute, becomes the bottleneck. Recent 1.58-bit quantization techniques (e.g., BitNet b1.58) dramatically reduce memory footprint but typically incur a perplexity degradation of 20-25% compared to FP16 baselines. In this work, we introduce Hybrid Gated Flow (HGF), a dual-stream architecture that couples a 1.58-bit ternary backbone with a learnable, low-rank FP16 correction path controlled by adaptive gates. Through extensive experiments on the TinyStories dataset across two training regimes (2500 and 3500 steps), we demonstrate that HGF 5.4 achieves a validation loss of 0.9306 compared to BitNet's 1.0294, recovering approximately 55% of the quality gap between pure ternary quantization and the FP16 baseline (0.8490). This recovery is achieved with only ~12-15% memory overhead beyond the ternary backbone. Furthermore, we provide empirical evidence for an emergent phenomenon: quantization as structural regularization. While a full-precision differential attention baseline (Diff_Only) exhibited training instability with validation loss exceeding 1.68, the ternary-anchored HGF maintained robust convergence throughout training. Finally, we report preliminary results extending this architecture to 1.2B and 3B parameter models trained on SlimPajama and FineWeb-Edu. These larger-scale experiments confirm that the architectural stability and quality recovery observed in small-scale proxies scale linearly to production-grade language modeling regimes.

Hybrid Gated Flow (HGF): Stabilizing 1.58-bit LLMs via Selective Low-Rank Correction

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理