Deep Optimizer States: Towards Scalable Training of Transformer Models Using Interleaved Offloading
作者: Avinash Maurya, Jie Ye, M. Mustafa Rafique, Franck Cappello, Bogdan Nicolae
分类: cs.LG, cs.AI, cs.DC, cs.ET, cs.PF
发布日期: 2024-10-26
💡 一句话要点
提出深度优化器状态以解决Transformer模型训练的内存瓶颈问题
🎯 匹配领域: 支柱九:具身大模型 (Embodied Foundation Models)
关键词: Transformer 大型语言模型 内存管理 动态优化 DeepSpeed 训练加速 GPU计算 混合计算
📋 核心要点
- 现有方法在训练大型Transformer模型时面临内存瓶颈,无法有效管理GPU与主机内存的结合。
- 本文提出了一种新颖的动态优化器状态管理技术,通过在每次迭代中调整优化器状态的存放位置来提高内存利用率。
- 实验结果表明,所提方法在DeepSpeed框架下实现了2.5倍的迭代加速,显著优于现有技术。
📝 摘要(中文)
随着Transformer和大型语言模型的迅速普及,其参数规模已达到数百亿,并持续增长。在这种情况下,训练Transformer的成本极高,常常面临“内存墙”的挑战。即使采用3D并行(管道、张量、数据)并聚合多GPU的内存,仍不足以容纳必要的数据结构。现有方法通常将优化器状态部分卸载到主机内存,进行混合CPU-GPU计算,但这种内存管理往往不够优化,导致数据移动与计算之间的重叠不足。本文提出了一种新技术,通过动态调整优化器状态在主机和GPU内存之间的移动,显著提高了训练效率,并在DeepSpeed中实现了2.5倍的迭代加速。
🔬 方法详解
问题定义:本文旨在解决在训练大型Transformer模型时,由于内存限制导致的训练效率低下问题。现有方法在GPU和主机内存的结合管理上存在不足,导致计算与数据移动之间的重叠不足。
核心思路:论文的核心思路是利用前向、反向和更新阶段的内存利用波动,动态调整优化器状态在主机和GPU内存之间的移动,以提高内存利用率和计算效率。
技术框架:整体架构包括将大型语言模型分割为多个子组,并根据提出的性能模型调度其更新阶段在CPU或GPU上执行。主要模块包括内存管理模块、性能模型模块和调度模块。
关键创新:最重要的技术创新在于动态优化器状态管理,通过实时监测和调整内存使用,显著提高了GPU和CPU的资源利用率。这一方法与传统的静态内存管理方式有本质区别。
关键设计:在设计中,考虑了数据移动成本、GPU与CPU的加速效果及共享资源的竞争,确保在不同阶段的计算需求与内存使用之间达到最佳平衡。
🖼️ 关键图片
📊 实验亮点
实验结果显示,所提方法在DeepSpeed框架下实现了2.5倍的迭代加速,相较于现有最先进的方法,显著提升了训练效率。这一成果为大规模模型的高效训练提供了新的思路和解决方案。
🎯 应用场景
该研究的潜在应用领域包括自然语言处理、计算机视觉和其他需要大规模模型训练的领域。通过提高训练效率,能够加速模型的迭代与优化,推动相关技术的进步与应用落地,具有重要的实际价值和未来影响。
📄 摘要(原文)
Transformers and large language models~(LLMs) have seen rapid adoption in all domains. Their sizes have exploded to hundreds of billions of parameters and keep increasing. Under these circumstances, the training of transformers is very expensive and often hits a ``memory wall'', i.e., even when using 3D parallelism (pipeline, tensor, data) and aggregating the memory of many GPUs, it is still not enough to hold the necessary data structures (model parameters, optimizer state, gradients, activations) in GPU memory. To compensate, state-of-the-art approaches offload the optimizer state, at least partially, to the host memory and perform hybrid CPU-GPU computations. However, the management of the combined host-GPU memory is often suboptimal and results in poor overlapping between data movements and computations. This leads to missed opportunities to simultaneously leverage the interconnect bandwidth and computational capabilities of CPUs and GPUs. In this paper, we leverage a key observation that the interleaving of the forward, backward and update phases generate fluctuations in the GPU memory utilization, which can be exploited to dynamically move a part of the optimizer state between the host and the GPU memory at each iteration. To this end, we design and implement \proj, a novel technique to split the LLM into subgroups, whose update phase is scheduled on either the CPU or the GPU based on our proposed performance model that addresses the trade-off between data movement cost, acceleration on the GPUs vs the CPUs, and competition for shared resources. We integrate our approach with DeepSpeed and demonstrate 2.5$\times$ faster iterations over state-of-the-art approaches using extensive experiments.