A Simple Linear Patch Revives Layer-Pruned Large Language Models

作者: Xinrui Chen, Haoli Bai, Tao Yuan, Ruikang Liu, Kang Zhao, Xianzhi Yu, Lu Hou, Tian Guan, Yonghong He, Chun Yuan

分类: cs.CL

发布日期: 2025-05-30 (更新: 2025-10-25)

备注: 26 pages, accepted to NeurIPS 2025

🔗 代码/项目: GITHUB

💡 一句话要点

提出LinearPatch以解决层修剪模型性能下降问题

🎯 匹配领域: 支柱二：RL算法与架构 (RL & Architecture) 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 层修剪 激活幅度 大型语言模型 模型压缩 Hadamard变换 通道缩放 性能保留 离线蒸馏

📋 核心要点

现有层修剪方法在压缩大型语言模型时，常导致性能显著下降，主要是由于激活幅度不匹配的问题。
论文提出LinearPatch，通过将Hadamard变换和通道缩放融合为一个矩阵乘法，解决了激活值分布偏移的问题。
在LLaMA-3-8B上，LinearPatch在修剪5层时保留了94.15%的性能，且通过离线蒸馏进一步提升至95.16%。

📝 摘要（中文）

层修剪已成为压缩大型语言模型（LLMs）的广泛应用技术。然而，现有的层修剪方法往往会导致显著的性能下降。我们发现，这种下降主要源于一个被忽视的问题：修剪接口处激活幅度的不匹配。修剪前后的激活值在规模上存在显著差异，导致在剩余层中传播时出现分布偏移。为了解决这一问题，我们提出了LinearPatch，这是一种轻量级的即插即用技术，它将两种操作融合为一个矩阵乘法：Hadamard变换和通道缩放。在LLaMA-3-8B上，LinearPatch在修剪32层中的5层时，保留了原模型性能的94.15%，超越了之前的最优方法4%。该补丁还可以通过内存高效的离线蒸馏进一步优化，保留率在30分钟内提升至95.16%。代码可在https://github.com/chenxinrui-tsinghua/LinearPatch获取。

🔬 方法详解

问题定义：论文要解决的问题是层修剪过程中激活幅度的不匹配，导致模型性能下降。现有方法未能有效处理修剪接口前后激活值的分布差异，造成性能损失。

核心思路：论文的核心思路是引入LinearPatch技术，通过将Hadamard变换与通道缩放结合，减少激活值的分布偏移，从而保持模型性能。这样的设计旨在通过简化计算过程来提高效率，同时解决激活幅度不匹配的问题。

技术框架：整体架构包括两个主要模块：Hadamard变换模块用于抑制特定token的异常值，通道缩放模块用于对齐激活统计信息。通过这两个模块的结合，形成一个高效的矩阵乘法操作。

关键创新：最重要的技术创新点在于将两种操作融合为一个矩阵乘法，显著降低了计算复杂度，同时有效解决了激活幅度不匹配的问题。这与现有方法的逐层修剪和调整策略形成了鲜明对比。

关键设计：在参数设置上，LinearPatch采用了针对特定token的Hadamard变换，并通过通道缩放来调整激活值的统计特性。损失函数设计上，重点关注激活值的分布一致性，以确保模型在修剪后的性能保持。网络结构上，LinearPatch能够无缝集成到现有的模型架构中，便于应用。

🖼️ 关键图片

📊 实验亮点

实验结果显示，LinearPatch在LLaMA-3-8B模型上修剪5层时，保留了94.15%的原始性能，超越了之前的最优方法4%。通过进一步的离线蒸馏，性能保留率在30分钟内提升至95.16%，展现了其在模型压缩中的有效性和高效性。

🎯 应用场景

该研究的潜在应用领域包括大型语言模型的压缩与优化，尤其是在资源受限的环境中，如移动设备和边缘计算。通过提高模型的计算效率和保持性能，LinearPatch可以在实际应用中显著降低延迟和资源消耗，推动智能应用的普及与发展。

📄 摘要（原文）

Layer pruning has emerged as a widely used technique for compressing large language models (LLMs). However, existing layer pruning approaches often incur substantial performance degradation. We identify the majority of this degradation to a single yet previously overlooked issue: \textit{the mismatch of activation magnitudes at the pruning interface}. The pre-interface activations exhibit significantly different scales from the post-interface ones, causing the distributional shift as it propagates through the remaining layers. To address this issue, we introduce \textsc{LinearPatch}, a lightweight and plug-and-play technique that fuses two operations into one matrix multiply at the pruning interface: (i) a Hadamard transformation that suppresses massive outliers at particular tokens and (ii) a channel-wise scaling that aligns activation statistics. On LLaMA-3-8B, \textsc{LinearPatch} preserves up to \textbf{94.15\%} of the original model's performance when pruning 5 out of 32 layers, outperforming the previous state of the art by \textbf{4\%}. The patch can be further refined with 5K unlabeled samples via memory-efficient offline distillation, pushing the retention to 95.16\% within only 30 minutes on a single GPU. Code is available at https://github.com/chenxinrui-tsinghua/LinearPatch.

A Simple Linear Patch Revives Layer-Pruned Large Language Models

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理