NoLoCo: No-all-reduce Low Communication Training Method for Large Models

📄 arXiv: 2506.10911v1 📥 PDF

作者: Jari Kolehmainen, Nikolay Blagoev, John Donaghy, Oğuzhan Ersoy, Christopher Nies

分类: cs.LG

发布日期: 2025-06-12


💡 一句话要点

提出NoLoCo以解决大模型训练中的通信瓶颈问题

🎯 匹配领域: 支柱九:具身大模型 (Embodied Foundation Models)

关键词: 大模型训练 低通信训练 优化算法 Nesterov动量 隐式同步 模型收敛 加速器

📋 核心要点

  1. 现有的大模型训练方法依赖于高带宽集群,成本高且可扩展性差,限制了模型规模。
  2. NoLoCo通过不显式同步所有模型参数,利用随机选择的权重进行隐式同步,降低了通信开销。
  3. 实验结果表明,NoLoCo在多种加速器数量和模型规模下,通信开销显著低于现有方法,并且收敛速度提高了4%。

📝 摘要(中文)

大语言模型的训练通常依赖于高带宽互连的计算集群,然而,随着集群规模的扩大,成本和可行性问题日益突出。现有的低通信训练方法仍需进行模型参数的同步,导致在低带宽网络上开销较大。本文提出了一种新颖的优化方法NoLoCo,该方法在训练过程中不显式同步所有模型参数,从而消除了集体通信的需求。NoLoCo通过一种新变体的Nesterov动量优化器,利用随机选择的其他模型权重进行部分平均,从而实现隐式同步。我们对NoLoCo进行了理论收敛性分析,并在不同规模的模型上进行了实证验证。

🔬 方法详解

问题定义:本文旨在解决在大规模模型训练中,由于依赖高带宽集群而导致的通信瓶颈问题。现有方法如DiLoCo仍需进行模型参数的全局同步,增加了训练的复杂性和开销。

核心思路:NoLoCo的核心思路是通过不显式同步所有模型参数,利用随机选择的其他模型权重进行部分平均,从而实现隐式同步,避免了集体通信的需求。

技术框架:NoLoCo的整体架构包括一个优化器模块,该模块采用Nesterov动量优化器的变体,通过随机选择的模型权重进行部分平均。训练过程中,模型参数在不同加速器间进行隐式同步,而不需要全局同步步骤。

关键创新:NoLoCo的主要创新在于其不需要全局阻塞通信,显著降低了通信开销,并且在训练过程中实现了更快的收敛速度。这一方法与现有的低通信训练方法有本质区别。

关键设计:在NoLoCo中,关键参数包括随机选择的权重比例和动量因子设置。损失函数与传统优化方法相似,但通过隐式同步机制提高了训练效率。

📊 实验亮点

实验结果显示,NoLoCo在不同规模的模型(从1.25亿到68亿参数)上,通信开销显著低于完全分片数据并行训练方法,且收敛速度提高了4%。与DiLoCo相比,NoLoCo的同步步骤速度快一个数量级,极大地减少了加速器的空闲时间。

🎯 应用场景

NoLoCo方法在大规模语言模型训练中具有广泛的应用潜力,尤其适用于资源有限的环境。其低通信开销的特性使得在低带宽网络下进行大规模模型训练成为可能,推动了AI模型的普及与应用。

📄 摘要(原文)

Training large language models is generally done via optimization methods on clusters containing tens of thousands of accelerators, communicating over a high-bandwidth interconnect. Scaling up these clusters is expensive and can become impractical, imposing limits on the size of models that can be trained. Several recent studies have proposed training methods that are less communication intensive, avoiding the need for a highly connected compute cluster. These state-of-the-art low communication training methods still employ a synchronization step for model parameters, which, when performed over all model replicas, can become costly on a low-bandwidth network. In this work, we propose a novel optimization method, NoLoCo, that does not explicitly synchronize all model parameters during training and, as a result, does not require any collective communication. NoLoCo implicitly synchronizes model weights via a novel variant of the Nesterov momentum optimizer by partially averaging model weights with a randomly selected other one. We provide both a theoretical convergence analysis for our proposed optimizer as well as empirical results from language model training. We benchmark NoLoCo on a wide range of accelerator counts and model sizes, between 125M to 6.8B parameters. Our method requires significantly less communication overhead than fully sharded data parallel training or even widely used low communication training method, DiLoCo. The synchronization step itself is estimated to be one magnitude faster than the all-reduce used in DiLoCo for few hundred accelerators training over the internet. We also do not have any global blocking communication that reduces accelerator idling time. Compared to DiLoCo, we also observe up to $4\%$ faster convergence rate with wide range of model sizes and accelerator counts.