Towards Efficient Automatic Self-Pruning of Large Language Models
作者: Weizhong Huang, Yuxin Zhang, Xiawu Zheng, Fei Chao, Rongrong Ji
分类: cs.LG
发布日期: 2025-02-20
💡 一句话要点
提出自我修剪框架以高效优化大型语言模型
🎯 匹配领域: 支柱九:具身大模型 (Embodied Foundation Models)
关键词: 大型语言模型 自我修剪 进化算法 后训练修剪 性能优化 自然语言处理 计算效率
📋 核心要点
- 现有的后训练结构化修剪方法在修剪大型语言模型时,容易导致性能显著下降,缺乏有效的修剪率确定机制。
- 本文提出的Self-Pruner框架通过LLMs自主执行进化搜索,自动确定每层的修剪率,从而减少人工干预。
- 实验表明,Self-Pruner在修剪LLaMA-2-70B至49B时,仅有0.80%的准确率下降,并实现了1.39倍的加速,效果显著优于现有方法。
📝 摘要(中文)
尽管大型语言模型(LLMs)具备卓越的能力,但由于其庞大的体积,部署面临挑战。后训练结构化修剪是一种有前景的解决方案,能够在不需要重新训练的情况下修剪LLMs,降低计算开销并友好于硬件部署。然而,这种训练无关的修剪方式会导致显著的性能下降。本文提出了自我修剪框架Self-Pruner,旨在自动确定每层的修剪率。Self-Pruner利用LLMs自主执行进化搜索过程,生成和评估大量候选解决方案,显著提高了修剪效率。实验结果表明,Self-Pruner在修剪LLaMA-2-70B至49B时,仅导致0.80%的准确率下降,同时在NVIDIA A100 80GB GPU上实现了1.39倍的加速。
🔬 方法详解
问题定义:本文旨在解决大型语言模型在后训练结构化修剪中性能下降的问题,现有方法缺乏有效的修剪率确定机制,导致修剪效果不理想。
核心思路:论文提出的Self-Pruner框架利用LLMs的自我生成能力,自动执行进化搜索过程,以确定每层的最佳修剪率,从而提高修剪效率并减少人工干预。
技术框架:Self-Pruner的整体架构包括生成种群、选择父代解决方案、交叉和变异操作等模块,形成一个完整的进化搜索流程。LLMs在这一过程中负责生成和评估候选解决方案。
关键创新:Self-Pruner的核心创新在于利用LLMs自主进行进化搜索,显著减少了人工干预,并提高了修剪率的确定精度,这与传统方法的手动设置形成鲜明对比。
关键设计:在设计中,Self-Pruner通过设定适当的进化算法参数,优化了种群生成和选择策略,同时确保了交叉和变异操作的有效性,以提升修剪效果。具体的参数设置和损失函数设计在实验中经过多次验证。
🖼️ 关键图片
📊 实验亮点
实验结果显示,Self-Pruner在将LLaMA-2-70B修剪至49B时,仅导致0.80%的准确率下降,并在NVIDIA A100 80GB GPU上实现了1.39倍的加速。进一步修剪至35B时,准确率下降仅为3.80%,加速幅度达到1.70倍,展现出优越的性能。
🎯 应用场景
该研究的潜在应用领域包括自然语言处理、智能对话系统和大规模文本生成等。通过高效的自我修剪,LLMs能够在保持性能的同时,显著降低计算资源消耗,促进其在资源受限环境中的应用,具有重要的实际价值和未来影响。
📄 摘要(原文)
Despite exceptional capabilities, Large Language Models (LLMs) still face deployment challenges due to their enormous size. Post-training structured pruning is a promising solution that prunes LLMs without the need for retraining, reducing computational overhead, and it is hardware-deployment friendly. However, the training-free nature of post-training structured pruning leads to significant performance degradation. We argue that the key to mitigating this issue lies in accurately determining the pruning rate for each layer. Meanwhile, we find that LLMs may have prior knowledge about their own redundancy. Based on this insight, we introduce $\textbf{Self-Pruner}$ an end-to-end automatic self-pruning framework for LLMs, which efficiently search layer-wise pruning rates. Specifically, $\textbf{Self-Pruner}$ leverages LLMs to autonomously execute the entire evolutionary search process to search for pruning rate configurations. In this process, LLMs are used to generate populations, select parent solutions from the current population, and perform crossover and mutation operations to produce offspring solutions. In this way, LLMs automatically generate and evaluate a large number of candidate solutions, effectively converging to find the pruning rate configurations with minimal human intervention. Extensive experiments demonstrate $\textbf{Self-Pruner}$'s better performance compared to existing state-of-the-art methods. Notably, $\textbf{Self-Pruner}$ prunes LLaMA-2-70B to 49B level with only 0.80$\%$ drop in accuracy across seven commonsense reasoning tasks, achieving a 1.39$\times$ speedup on NVIDIA A100 80GB GPU. Further pruning to 35B level resulted in only a 3.80$\%$ decrease in accuracy while obtaining a 1.70$\times$ speedup.