ReLU's Revival: On the Entropic Overload in Normalization-Free Large Language Models

作者: Nandan Kumar Jha, Brandon Reagen

分类: cs.LG, cs.AI

发布日期: 2024-10-12 (更新: 2024-11-16)

备注: Accepted to NeurIPS 2024 Workshop on Attributing Model Behavior at Scale (Camera-ready version)

🔗 代码/项目: GITHUB

💡 一句话要点

ReLU激活函数在无LayerNorm的大语言模型中表现优于GELU，提升困惑度。

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 大语言模型 激活函数 ReLU GELU LayerNorm 无Normalization Transformer

📋 核心要点

LayerNorm虽然能稳定训练，但在可解释性、信号传播和计算复杂度上存在挑战，因此需要探索无LayerNorm的替代方案。
论文发现，在无LayerNorm的Transformer模型中，ReLU激活函数优于常用的GELU，核心原因是GELU会导致早期层出现熵过载。
实验结果表明，在无LayerNorm的架构中，使用ReLU可以显著提升模型性能，具体表现为困惑度降低了8.2%。

📝 摘要（中文）

LayerNorm是现代大型语言模型（LLM）中的关键组件，用于稳定训练并确保平滑优化。然而，它在机制可解释性、异常特征抑制、忠实信号传播以及私有推理的计算和通信复杂性方面引入了重大挑战。本文探讨了无LayerNorm的仅解码器LLM中理想的激活函数。与基于Transformer的模型中通常偏好GELU相反，我们的经验结果表明了一个相反的趋势——ReLU在无LayerNorm的模型中明显优于GELU，导致困惑度提高了8.2%。我们发现GELU的一个关键问题是，早期层经历了熵过载，导致注意力头的表征能力未得到充分利用。这突显了像GELU这样更平滑的激活函数不适合无LayerNorm的架构，而ReLU的几何特性——输入空间中的专业化和类内选择性——在没有LayerNorm的情况下，能够改善学习动态并更好地保留信息。这项研究为优化Transformer架构提供了关键见解，尤其是在LayerNorm引入重大挑战的情况下。代码和实现可在https://github.com/Nandan91/relu-revival-normfree获得。

🔬 方法详解

问题定义：现有的大型语言模型依赖LayerNorm来稳定训练过程，但LayerNorm引入了机制可解释性差、异常特征抑制、信号传播失真以及计算复杂度高等问题。因此，需要探索在没有LayerNorm的情况下，如何训练高性能的Transformer模型。现有方法，特别是依赖GELU激活函数的模型，在移除LayerNorm后性能会显著下降，这表明GELU可能不适合无LayerNorm的架构。

核心思路：论文的核心思路是探索更适合无LayerNorm架构的激活函数。作者发现，GELU在早期层会经历熵过载，导致注意力头的表征能力未被充分利用。因此，作者提出使用ReLU激活函数作为替代方案。ReLU的几何特性，如输入空间中的专业化和类内选择性，有助于改善学习动态，并在没有LayerNorm的情况下更好地保留信息。

技术框架：论文主要关注decoder-only的Transformer模型，并研究不同激活函数对模型性能的影响。实验中，作者移除了LayerNorm，并分别使用GELU和ReLU作为激活函数进行训练和评估。模型的整体架构与标准的Transformer decoder类似，包括多头注意力机制和前馈神经网络。

关键创新：论文的关键创新在于发现ReLU在无LayerNorm的Transformer模型中表现优于GELU。这一发现颠覆了传统认知，即GELU是Transformer模型中更优的激活函数。论文还深入分析了GELU在无LayerNorm架构中表现不佳的原因，即早期层的熵过载。

关键设计：论文的关键设计包括：1) 移除LayerNorm；2) 使用ReLU作为激活函数；3) 详细分析GELU和ReLU在不同层级的激活值分布，以验证熵过载的假设；4) 通过困惑度指标评估模型性能。具体的参数设置和网络结构与标准的Transformer decoder保持一致，以便更好地比较不同激活函数的影响。

🖼️ 关键图片

📊 实验亮点

实验结果表明，在无LayerNorm的Transformer模型中，使用ReLU激活函数可以显著提高模型性能，具体表现为困惑度降低了8.2%。这一结果表明，ReLU在无LayerNorm的架构中优于常用的GELU激活函数，为优化Transformer架构提供了新的思路。

🎯 应用场景

该研究成果可应用于对计算资源或隐私保护有严格要求的场景，例如边缘设备上的自然语言处理任务或私有推理服务。通过移除LayerNorm并使用ReLU激活函数，可以在不显著降低模型性能的前提下，降低计算复杂度和通信开销，从而实现更高效的模型部署。

📄 摘要（原文）

LayerNorm is a critical component in modern large language models (LLMs) for stabilizing training and ensuring smooth optimization. However, it introduces significant challenges in mechanistic interpretability, outlier feature suppression, faithful signal propagation, and computational and communication complexity of private inference. This work explores desirable activation functions in normalization-free decoder-only LLMs. Contrary to the conventional preference for the GELU in transformer-based models, our empirical findings demonstrate an {\em opposite trend} -- ReLU significantly outperforms GELU in LayerNorm-free models, leading to an {\bf 8.2\%} perplexity improvement. We discover a key issue with GELU, where early layers experience entropic overload, leading to the under-utilization of the representational capacity of attention heads. This highlights that smoother activations like GELU are {\em ill-suited} for LayerNorm-free architectures, whereas ReLU's geometrical properties -- specialization in input space and intra-class selectivity -- lead to improved learning dynamics and better information retention in the absence of LayerNorm. This study offers key insights for optimizing transformer architectures where LayerNorm introduces significant challenges. The code and implementation are available at https://github.com/Nandan91/relu-revival-normfree

ReLU's Revival: On the Entropic Overload in Normalization-Free Large Language Models

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理