Dynamic Low-Rank Sparse Adaptation for Large Language Models

作者: Weizhong Huang, Yuxin Zhang, Xiawu Zheng, Yang Liu, Jing Lin, Yiwu Yao, Rongrong Ji

分类: cs.LG

发布日期: 2025-02-20

备注: Accepted to ICLR 2025

🔗 代码/项目: GITHUB

💡 一句话要点

提出动态低秩稀疏适配(LoSA)方法，提升稀疏大语言模型性能且不增加推理延迟。

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 大语言模型 稀疏化 低秩适配 模型微调 表示互信息

📋 核心要点

现有稀疏化大语言模型的方法存在性能显著下降的问题，直接应用LoRA进行微调无法有效恢复性能，且LoRA权重难以集成到稀疏模型中。
LoSA的核心思想是在微调过程中，动态地根据稀疏权重来稀疏化LoRA的结果，并利用表示互信息来指导层级稀疏率的确定。
实验结果表明，LoSA能够显著提升稀疏LLaMA-2-7B模型的性能，降低困惑度并提高零样本准确率，同时实现推理加速。

📝 摘要（中文）

本文提出动态低秩稀疏适配(LoSA)，一种新颖的方法，旨在统一框架内无缝集成低秩适配到大语言模型的稀疏性中，从而增强稀疏大语言模型的性能，且不增加推理延迟。LoSA在微调期间，基于相应的稀疏权重动态地稀疏化LoRA的结果，从而保证LoRA模块可以在训练后集成到稀疏大语言模型中。此外，LoSA利用表示互信息(RMI)作为指标来确定层的重要性，从而有效地确定微调期间的层级稀疏率。基于此，LoSA根据层级重建误差的变化调整LoRA模块的秩，为每一层分配适当的微调，以减少稠密和稀疏大语言模型之间的输出差异。大量实验表明，LoSA可以在几个小时内有效地提高稀疏大语言模型的效率，而不会引入任何额外的推理负担。例如，LoSA将稀疏LLaMA-2-7B的困惑度降低了68.73，并将零样本准确率提高了16.32%，在CPU上实现了2.60倍的加速，在GPU上实现了2.23倍的加速，仅需在单个NVIDIA A100 80GB GPU上进行45分钟的微调。

🔬 方法详解

问题定义：论文旨在解决大语言模型稀疏化后性能显著下降的问题。现有方法，如直接对稀疏模型进行LoRA微调，存在两个主要痛点：一是LoRA权重无法直接集成到稀疏模型中，二是高稀疏度下性能恢复不足。

核心思路：LoSA的核心思路是在微调过程中，动态地将LoRA的输出与原始稀疏模型的稀疏结构对齐，确保LoRA学习到的知识能够无缝集成到稀疏模型中。同时，利用表示互信息(RMI)来指导层级的稀疏率分配，并根据层级重建误差自适应地调整LoRA的秩，从而更有效地进行微调。

技术框架：LoSA的整体框架包括以下几个主要步骤：1) 初始化稀疏大语言模型；2) 在微调过程中，对LoRA模块的输出进行动态稀疏化，使其与原始稀疏模型的结构保持一致；3) 使用表示互信息(RMI)来评估每一层的重要性，并据此调整层级的稀疏率；4) 根据层级重建误差自适应地调整LoRA模块的秩；5) 将微调后的LoRA权重集成到稀疏模型中。

关键创新：LoSA的关键创新在于：1) 动态稀疏化LoRA输出，确保LoRA权重能够集成到稀疏模型中；2) 利用表示互信息(RMI)来指导层级稀疏率的分配，从而更有效地利用计算资源；3) 根据层级重建误差自适应地调整LoRA模块的秩，实现更精细化的微调。与现有方法相比，LoSA能够更好地平衡稀疏性和性能，并在高稀疏度下实现更好的性能恢复。

关键设计：LoSA的关键设计包括：1) 动态稀疏化策略：在LoRA输出的基础上，根据原始稀疏模型的权重mask，将LoRA的输出中对应于稀疏位置的权重置零；2) 表示互信息(RMI)计算：使用RMI来衡量每一层输入和输出之间的信息传递量，RMI越大，表示该层越重要，应该分配更低的稀疏率；3) 自适应秩调整：根据每一层的重建误差，动态调整LoRA模块的秩，误差越大，表示该层需要更多的参数进行微调，应该分配更高的秩。

🖼️ 关键图片

📊 实验亮点

LoSA在稀疏LLaMA-2-7B模型上取得了显著的性能提升。实验结果表明，LoSA能够将稀疏LLaMA-2-7B的困惑度降低68.73，并将零样本准确率提高16.32%。同时，LoSA在CPU上实现了2.60倍的加速，在GPU上实现了2.23倍的加速，且仅需在单个NVIDIA A100 80GB GPU上进行45分钟的微调。这些结果表明，LoSA能够有效地提高稀疏大语言模型的效率，且具有良好的可扩展性。

🎯 应用场景

LoSA可应用于各种需要部署在资源受限设备上的大语言模型场景，例如移动设备、边缘计算设备等。通过LoSA，可以在不增加推理延迟的前提下，显著提升稀疏大语言模型的性能，从而实现更高效的模型部署和应用。该方法对于推动大语言模型在实际场景中的应用具有重要意义。

📄 摘要（原文）

Despite the efficacy of network sparsity in alleviating the deployment strain of Large Language Models (LLMs), it endures significant performance degradation. Applying Low-Rank Adaptation (LoRA) to fine-tune the sparse LLMs offers an intuitive approach to counter this predicament, while it holds shortcomings include: 1) The inability to integrate LoRA weights into sparse LLMs post-training, and 2) Insufficient performance recovery at high sparsity ratios. In this paper, we introduce dynamic Low-rank Sparse Adaptation (LoSA), a novel method that seamlessly integrates low-rank adaptation into LLM sparsity within a unified framework, thereby enhancing the performance of sparse LLMs without increasing the inference latency. In particular, LoSA dynamically sparsifies the LoRA outcomes based on the corresponding sparse weights during fine-tuning, thus guaranteeing that the LoRA module can be integrated into the sparse LLMs post-training. Besides, LoSA leverages Representation Mutual Information (RMI) as an indicator to determine the importance of layers, thereby efficiently determining the layer-wise sparsity rates during fine-tuning. Predicated on this, LoSA adjusts the rank of the LoRA module based on the variability in layer-wise reconstruction errors, allocating an appropriate fine-tuning for each layer to reduce the output discrepancies between dense and sparse LLMs. Extensive experiments tell that LoSA can efficiently boost the efficacy of sparse LLMs within a few hours, without introducing any additional inferential burden. For example, LoSA reduced the perplexity of sparse LLaMA-2-7B by 68.73 and increased zero-shot accuracy by 16.32$\%$, achieving a 2.60$\times$ speedup on CPU and 2.23$\times$ speedup on GPU, requiring only 45 minutes of fine-tuning on a single NVIDIA A100 80GB GPU. Code is available at https://github.com/wzhuang-xmu/LoSA.

Dynamic Low-Rank Sparse Adaptation for Large Language Models

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理