Star Elastic: Many-in-One Reasoning LLMs with Efficient Budget Control

📄 arXiv: 2605.07182v1 📥 PDF

作者: Ali Taghibakhshi, Ruisi Cai, Saurav Muralidharan, Sharath Turuvekere Sreenivas, Aditya Vavre, Ameya Sunil Mahabaleshwarkar, Bilal Kartal, Sheldon Liang, Marcin Chochowski, Zijia Chen, Akhiad Bercovich, Ran Zilberstein, Ran El-Yaniv, Yonatan Geifman, Daniel Korzekwa, Yoshi Suhara, Oluwatobi Olabiyi, Ashwath Aithal, Nima Tajbakhsh, Pavlo Molchanov

分类: cs.LG

发布日期: 2026-05-08


💡 一句话要点

提出Star Elastic训练框架,通过单次后训练实现嵌套子模型并支持推理阶段的动态预算控制。

🎯 匹配领域: 支柱二:RL算法与架构 (RL & Architecture) 支柱九:具身大模型 (Embodied Foundation Models)

关键词: 大语言模型 模型压缩 知识蒸馏 混合专家模型 弹性推理 模型嵌套 计算效率

📋 核心要点

  1. 现有LLM家族训练需为每个模型进行独立运行,导致计算资源浪费且训练周期冗长,难以适应多变的应用需求。
  2. Star Elastic通过嵌套架构设计,在单次后训练中生成多个子模型,并引入弹性预算控制,实现推理阶段的动态资源分配。
  3. 实验证明该方法在Nemotron Nano模型上实现了360倍的训练成本缩减,并提升了16%的精度及1.9倍的推理速度。

📝 摘要(中文)

训练大语言模型(LLM)家族(无论是从头训练还是迭代压缩)成本高昂且效率低下,因为每个模型都需要独立的训练过程。本文引入了Star Elastic,这是一种创新的LLM后训练方法,通过单次后训练任务,利用单一运行的计算资源生成N个嵌套子模型,实现了N倍的成本节约。Star Elastic不仅降低了训练成本,还解决了高效推理中的核心限制:静态架构无法根据Token难度动态分配资源。通过解锁弹性预算控制,Star Elastic支持在推理的不同阶段(思考与回答)使用不同的子模型。该方法支持SSM、嵌入通道、MoE和FFN轴的嵌套,并结合了端到端可训练路由器与课程知识蒸馏。在NVIDIA Nemotron Nano模型上的实验表明,该方法在保持性能的同时,训练成本较从头训练降低了360倍,并显著提升了精度-延迟帕累托前沿。

🔬 方法详解

问题定义:现有LLM训练范式要求为不同规模的模型分别进行训练或压缩,导致计算资源极度浪费。此外,静态模型架构无法根据推理任务的难度(如思考阶段与回答阶段)动态调整计算预算,限制了推理效率。

核心思路:Star Elastic的核心思想是“嵌套式训练”,即在一个父模型中嵌入多个子模型。通过共享权重并利用课程知识蒸馏,使子模型能够继承父模型的推理能力,从而实现单次训练覆盖多种规模需求。

技术框架:该框架基于Nemotron Elastic,支持在SSM、嵌入通道、MoE和FFN等多个维度进行嵌套。系统包含一个端到端可训练的路由器,用于在推理时根据Token难度动态选择最合适的子模型,并支持量化感知蒸馏(QAD)以适配边缘部署。

关键创新:最大的创新在于实现了“推理阶段的弹性预算控制”,即模型可以根据推理任务的复杂性,在思考和回答阶段切换不同规模的子模型,打破了传统静态架构的性能瓶颈。

关键设计:采用了课程学习策略进行知识蒸馏,确保子模型在参数量减少的情况下仍能保持高性能。同时,通过Quantization-Aware Distillation(QAD)技术,使得嵌套模型在FP8/NVFP4量化下依然能保持零样本切片能力,显著降低了部署足迹。

🖼️ 关键图片

fig_0
fig_1
fig_2

📊 实验亮点

Star Elastic在Nemotron Nano v3模型上成功生成了23B和12B子模型,训练成本较从头训练降低360倍,较现有压缩技术降低7倍。在推理阶段,通过动态模型选择,实现了最高16%的精度提升和1.9倍的延迟降低,在精度-延迟帕累托前沿上表现优异。

🎯 应用场景

该技术适用于资源受限的边缘计算设备、实时推理系统以及需要根据任务难度动态调整计算成本的复杂推理场景。通过弹性模型选择,它能显著提升智能助手、代码生成及长逻辑链推理任务的响应速度与准确性,具有极高的工业部署价值。

📄 摘要(原文)

Training a family of large language models (LLMs), either from scratch or via iterative compression, is prohibitively expensive and inefficient, requiring separate training runs for each model in the family. In this paper, we introduce Star Elastic, a novel LLM post-training method that adds N nested submodels to a given parent reasoning model using the compute of one run (N-fold savings) via a single post-training job. Beyond reducing training costs, Star Elastic also addresses a fundamental limitation of efficient reasoning: the rigidity of static architectures, which forces the allocation of constant resources regardless of token difficulty. By unlocking elastic budget control, Star Elastic enables a novel inference scheme that uses different submodels for each reasoning phase (thinking and answering). Star Elastic supports (1) nesting along the SSM, embedding channel, MoE, and FFN axes, (2) learning nested submodels via an end-to-end trainable router, and (3) curriculum-based knowledge distillation. Building on the Nemotron Elastic framework, we apply Star Elastic to the NVIDIA Nemotron Nano models, with a particular focus on hybrid Mixture-of-Experts (MoE) architectures: from Nemotron Nano v3 (30B/3.6A), we generate 23B (2.8A) and 12B (2.0A) variants with 160B training tokens. All nested models match or outperform independently trained baselines of comparable size and achieve a 360x reduction versus pretraining from scratch and a 7x reduction over state-of-the-art compression. Crucially, elastic budget control advances the accuracy-latency Pareto frontier, achieving up to 16% higher accuracy and 1.9x lower latency via dynamic per-phase model selection. We further extend Star Elastic to quantized regimes via Quantization-Aware Distillation (QAD), producing nested NVFP4 and FP8 elastic checkpoints that preserve zero-shot slicing while delivering smaller deployment footprints.