Weight-Decay Turns Transformer Loss Landscapes Villani: Functional-Analytic Foundations for Optimization and Generalization

作者: Abhijit Das, Sayantan Dutta

分类: cs.LG, eess.AS

发布日期: 2026-05-07

备注: 17 pages, 10 figures

💡 一句话要点

提出权重衰减机制以优化Transformer损失景观

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 权重衰减 Transformer 损失景观 功能分析 优化理论 深度学习 泛化能力

📋 核心要点

现有的Transformer模型在使用权重衰减时，其对损失景观的影响缺乏理论支持，导致优化效果不明确。
本文提出了一种功能分析的方法，系统地表征了带有$L^2$正则化的Transformer损失，揭示其数学性质与优化性能之间的关系。
实验结果表明，权重衰减不仅在泛化性能上有显著提升，还能加速Langevin混合过程，验证了理论分析的有效性。

📝 摘要（中文）

权重衰减作为大型语言模型中的一种正则化方法，其在Transformer损失景观中的具体作用尚未得到理论上的深入探讨。本文首次从功能分析的角度对标准Transformer目标——带有$L^2$正则化的交叉熵损失进行了严格的表征，证明其满足Villani的强制能量函数标准。我们展示了正则化损失$ ext{F}$的无限可微性、至少二次增长性、具有高斯可积尾部，并满足微分增长条件。基于此结构，我们推导出显式的log-Sobolev和Poincaré常数，连接正则化强度与模型维度，提供了对噪声随机梯度下降的有限时间收敛保证和PAC-Bayesian泛化界限。通过在GPT-Neo-125M模型上的实验验证了理论结果。

🔬 方法详解

问题定义：本文旨在解决权重衰减在Transformer损失景观中的作用尚未得到充分理论探讨的问题，现有方法缺乏对其影响的系统性分析。

核心思路：通过功能分析的方法，本文首次严格表征了带有$L^2$正则化的交叉熵损失，揭示其数学性质与优化性能之间的关系，提供了理论支持。

技术框架：整体架构包括对正则化损失的数学性质分析，推导出log-Sobolev和Poincaré常数，并通过实验验证理论结果。主要模块包括理论推导、诊断工具开发及实验验证。

关键创新：本文的主要创新在于首次将Villani的强制能量函数标准应用于Transformer损失的分析中，揭示了权重衰减对优化过程的深远影响。

关键设计：在损失函数中引入$L^2$正则化，设计了可扩展的Villani诊断工具$Ψ_s(θ)$，并利用Hutchinson迹探针高效估计该诊断工具，适用于超过1亿参数的模型。实验中使用的模型为GPT-Neo-125M，数据集包括Penn Treebank和WikiText-103。

🖼️ 关键图片

📊 实验亮点

实验结果显示，$Ψ_s$的二次增长与Hessian的谱膨胀一致，验证了理论分析的有效性。此外，模型在GPT-Neo-125M上的实验结果表明，权重衰减显著加速了收敛速度，提升了泛化性能。

🎯 应用场景

该研究的潜在应用领域包括自然语言处理、深度学习模型的优化与泛化能力提升等。通过理论分析与实验验证，研究为大型语言模型的训练提供了新的视角，可能推动更高效的模型设计与应用。

📄 摘要（原文）

Weight decay is widely used as a regularizer in large language models, yet its precise role in shaping Transformer loss landscapes remains theoretically underexplored. This paper provides the first rigorous functional-analytic characterization of the standard Transformer objective--cross-entropy loss with $L^2$ regularization--by proving it satisfies Villani's criteria for coercive energy functions. Specifically, we show that the regularized loss $\mathcal{F}$ is infinitely differentiable, grows at least quadratically, has Gaussian-integrable tails, and satisfies the differential growth condition $-Δ\mathcal{F} + \tfrac{1}{s}\|\nabla\mathcal{F}\|^{2} \to \infty$ as $\|θ\| \to \infty$ for all $s>0$. From this structure, we derive explicit log-Sobolev and Poincaré constants $C_{\mathrm{LS}} \leq λ^{-1} + d/λ^{2}$, linking the regularization strength $λ$ and model dimension $d$ to finite-time convergence guarantees for noisy stochastic gradient descent and PAC-Bayesian generalization bounds that tighten with increasing $λ$. To validate our theory, we introduce a scalable Villani diagnostic $Ψ_s(θ) = -Δ\mathcal{F} + s^{-1}\|\nabla \mathcal{F}\|^2$ and estimate it efficiently using Hutchinson trace probes in models with over 100M parameters. Experiments on GPT-Neo-125M across Penn Treebank and WikiText-103 confirm the predicted quadratic growth of $Ψ_s$, spectral inflation of the Hessian, and exponential convergence behavior consistent with our log-Sobolev analysis. These results demonstrate that weight decay not only improves generalization empirically but also establishes the mathematical conditions required for fast Langevin mixing and theoretically grounded curvature-aware optimization in deep learning.

Weight-Decay Turns Transformer Loss Landscapes Villani: Functional-Analytic Foundations for Optimization and Generalization

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理