RAST: Reasoning Activation in LLMs via Small-model Transfer

📄 arXiv: 2506.15710v1 📥 PDF

作者: Siru Ouyang, Xinyu Zhu, Zilin Xiao, Minhao Jiang, Yu Meng, Jiawei Han

分类: cs.LG, cs.AI

发布日期: 2025-05-30

🔗 代码/项目: PROJECT_PAGE


💡 一句话要点

提出RAST以高效提升大语言模型的推理能力

🎯 匹配领域: 支柱二:RL算法与架构 (RL & Architecture) 支柱九:具身大模型 (Embodied Foundation Models)

关键词: 强化学习 大语言模型 推理能力 模型转移 小模型训练 数学推理 GPU优化

📋 核心要点

  1. 现有的强化学习方法在大规模应用中面临高资源消耗和多模型复制的挑战。
  2. 本文提出RAST方法,通过小模型的RL训练来转移概率调整,从而提升大模型的推理能力。
  3. 实验表明,RAST在多个数学推理基准上显著提升了基础模型的推理能力,且GPU内存需求更低。

📝 摘要(中文)

强化学习(RL)已成为提升大语言模型(LLMs)推理能力的有效方法,但其大规模应用仍面临高资源消耗的问题。现有研究表明,RL并不赋予模型新知识,而是重塑输出分布以激活潜在的推理能力。基于此,本文提出RAST方法,通过将小模型的RL训练引发的概率调整转移到大模型中,从而高效提升大模型的推理能力。实验结果显示,RAST在多个数学推理基准上显著增强了基础模型的推理能力,同时所需的GPU内存显著低于直接的RL训练,甚至在某些情况下性能优于RL训练的模型。

🔬 方法详解

问题定义:本文旨在解决现有强化学习在大语言模型推理能力提升中的高资源消耗问题,尤其是在大规模应用时的效率低下。

核心思路:论文提出的核心思路是利用小模型的RL训练来激活大模型中潜在的推理能力,通过转移小模型的概率调整来实现高效的推理能力提升。

技术框架:RAST方法的整体架构包括两个主要阶段:首先训练一个小模型以进行RL训练,然后将其输出概率的调整转移到更大的基础模型中。

关键创新:RAST的创新点在于提出了一种新的模型转移机制,利用小模型的RL训练成果来优化大模型的推理能力,这与传统的直接在大模型上进行RL训练的方法有本质区别。

关键设计:在设计上,RAST关注于小模型的训练过程和输出概率的精确调整,确保这些调整能够有效地转移到大模型中,同时在损失函数和参数设置上进行了优化,以提高模型的推理性能。

🖼️ 关键图片

fig_0
fig_1
fig_2

📊 实验亮点

实验结果显示,RAST在多个数学推理基准上显著提升了基础模型的推理能力,所需GPU内存显著低于直接RL训练,某些情况下性能甚至优于RL训练的模型,展示了RAST的高效性和实用性。

🎯 应用场景

该研究的潜在应用领域包括自然语言处理、智能问答系统和自动推理等。通过提升大语言模型的推理能力,RAST可以在教育、金融分析、科学研究等多个领域提供更为准确和高效的智能服务,未来可能推动更广泛的AI应用落地。

📄 摘要(原文)

Reinforcement learning (RL) has become a powerful approach for improving the reasoning capabilities of large language models (LLMs), as evidenced by recent successes such as OpenAI's o1 and Deepseek-R1. However, applying RL at scale remains intimidatingly resource-intensive, requiring multiple model copies and extensive GPU workloads. On the other hand, while being powerful, recent studies suggest that RL does not fundamentally endow models with new knowledge; rather, it primarily reshapes the model's output distribution to activate reasoning capabilities latent in the base model. Building on this insight, we hypothesize that the changes in output probabilities induced by RL are largely model-size invariant, opening the door to a more efficient paradigm: training a small model with RL and transferring its induced probability shifts to larger base models. To verify our hypothesis, we conduct a token-level analysis of decoding trajectories and find high alignment in RL-induced output distributions across model scales, validating our hypothesis. Motivated by this, we propose RAST, a simple yet effective method that transfers reasoning behaviors by injecting RL-induced probability adjustments from a small RL-trained model into larger models. Experiments across multiple mathematical reasoning benchmarks show that RAST substantially and consistently enhances the reasoning capabilities of base models while requiring significantly lower GPU memory than direct RL training, sometimes even yielding better performance than the RL-trained counterparts. Our findings offer new insights into the nature of RL-driven reasoning and practical strategies for scaling its benefits without incurring its full computational cost. The project page of RAST is available at https://ozyyshr.github.io/RAST/.