ALLoRA: Adaptive Learning Rate Mitigates LoRA Fatal Flaws

📄 arXiv: 2410.09692v1 📥 PDF

作者: Hai Huang, Randall Balestriero

分类: cs.LG, cs.AI

发布日期: 2024-10-13


💡 一句话要点

提出ALLoRA以解决LoRA在短期训练中的局限性

🎯 匹配领域: 支柱九:具身大模型 (Embodied Foundation Models)

关键词: 低秩适应 自适应学习率 短期训练 大型语言模型 模型微调 Dropout 超参数优化

📋 核心要点

  1. LoRA在短期训练中存在Dropout不适用、初始化缓慢和层间短视交互等问题。
  2. 提出ALLoRA,通过自适应学习率和去掉Dropout及缩放因子来解决LoRA的局限性。
  3. 实验结果显示ALLoRA在多种设置下准确性优于LoRA及其变体,验证了其有效性。

📝 摘要(中文)

低秩适应(LoRA)是大型语言模型(LLM)微调的关键技术。本文识别了LoRA在有限数据和训练步骤下的三大核心局限:Dropout在短期训练中不适用、初始化导致的训练动态缓慢,以及不同层之间的短视交互。为此,提出了一种无Dropout和无缩放因子的自适应学习率LoRA(ALLoRA),通过按参数的$ ext{l}_2$范数反比缩放梯度,缓解了这些问题。实验结果表明,ALLoRA在多种设置下的准确性优于LoRA及其变体,如DoRA。

🔬 方法详解

问题定义:论文要解决LoRA在短期训练中的三个具体问题:Dropout不适合短期训练、初始化导致的动态缓慢,以及层间交互的短视性。这些问题限制了模型的收敛性和性能。

核心思路:论文提出ALLoRA,通过去掉Dropout和缩放因子,结合自适应学习率来改善LoRA的训练动态,旨在提高短期训练的效果和收敛速度。

技术框架:ALLoRA的整体架构包括自适应学习率机制,按参数的$ ext{l}_2$范数反比缩放梯度,去掉了Dropout和缩放因子,简化了模型的超参数设置。

关键创新:ALLoRA的主要创新在于其自适应学习率设计,能够动态调整梯度,解决了LoRA的三大局限,与传统LoRA方法相比,显著提高了短期训练的效果。

关键设计:ALLoRA去掉了两个超参数:缩放因子和Dropout率,采用了基于样本和参数的梯度缩放策略,确保了在短期训练中的有效性和稳定性。

🖼️ 关键图片

fig_0
fig_1
fig_2

📊 实验亮点

实验结果表明,ALLoRA在多种设置下的准确性超过了传统LoRA,尤其是在短期训练场景中,表现出更快的收敛速度和更高的最终性能,验证了其优越性。与DoRA等新型LoRA变体相比,ALLoRA在准确性上也有显著提升。

🎯 应用场景

该研究的潜在应用领域包括自然语言处理、机器翻译和对话系统等,ALLoRA的设计能够在数据稀缺的情况下提升模型的微调效果,具有重要的实际价值。未来,ALLoRA可能推动更多高效的微调方法的研究与应用,提升大型语言模型的适应性和性能。

📄 摘要(原文)

Low-Rank Adaptation (LoRA) is the bread and butter of Large Language Model (LLM) finetuning. LoRA learns an additive low-rank perturbation, $AB$, of a pretrained matrix parameter $W$ to align the model to a new task or dataset with $W+AB$. We identify three core limitations to LoRA for finetuning--a setting that employs limited amount of data and training steps. First, LoRA employs Dropout to prevent overfitting. We prove that Dropout is only suitable for long training episodes but fails to converge to a reliable regularizer for short training episodes. Second, LoRA's initialization of $B$ at $0$ creates a slow training dynamic between $A$ and $B$. That dynamic is also exacerbated by Dropout that further slows the escape from $0$ for $B$ which is particularly harmful for short training episodes. Third, the scaling factor multiplying each LoRA additive perturbation creates ``short-sighted'' interactions between the LoRA modules of different layers. Motivated by principled analysis of those limitations, we find an elegant solution: a Dropout-free, scaling-free, LoRA with Adaptive Learning rate--coined ALLoRA. By scaling the per sample and per parameter gradients with a coefficient inversely proportional to parameters' $\ell_2$ norm, ALLoRA alleviates those three limitations. As a by-product, ALLoRA removes two hyper-parameters from LoRA: the scaling factor and the dropout rate. Empirical results show that ALLoRA admits better accuracy than LoRA on various settings, including against recent LoRA variants such as Weight-Decomposed Low-Rank Adaptation (DoRA). Ablation studies show our solution is the optimal in a family of weight-dependent / output-dependent approaches on various LLMs including the latest Llama3.