You Only Need Minimal RLVR Training: Extrapolating LLMs via Rank-1 Trajectories
作者: Zhepei Wei, Xinyu Zhu, Wei-Lin Chen, Chengsong Huang, Jiaxin Huang, Yu Meng
分类: cs.LG, cs.CL
发布日期: 2026-05-20
备注: preprint. Code: https://github.com/weizhepei/RELEX
🔗 代码/项目: GITHUB
💡 一句话要点
提出RELEX以高效外推RLVR训练结果
🎯 匹配领域: 支柱二:RL算法与架构 (RL & Architecture) 支柱九:具身大模型 (Embodied Foundation Models)
关键词: 强化学习 可验证奖励 参数轨迹 外推方法 低秩近似 线性回归 自然语言处理 模型优化
📋 核心要点
- 现有的RLVR方法在参数轨迹的几何特性上缺乏深入研究,导致外推性能的潜力未被充分挖掘。
- 提出RELEX方法,通过短期观察窗口估计秩-1子空间,利用线性回归外推未来检查点,无需额外学习模型。
- RELEX在多个模型上表现出色,能够在仅使用15%训练步骤的情况下,达到或超过RLVR的性能,展现出显著的外推能力。
📝 摘要(中文)
强化学习与可验证奖励(RLVR)已成为提升大型语言模型(LLMs)推理能力的主流方法,但其参数轨迹的几何特性尚未得到充分探索。本文展示了RLVR权重轨迹的低秩特性,发现大多数下游性能提升可通过参数增量的秩-1近似来捕捉。基于此,提出了一种简单且计算高效的方法RELEX,该方法通过线性回归从短观察窗口中估计秩-1子空间,并外推未来检查点。实验表明,RELEX在多个模型上实现了与RLVR相当或更优的性能,且仅需15%的完整RLVR训练步骤。RELEX的成功源于其“去噪”效果,能够有效抑制随机优化噪声。
🔬 方法详解
问题定义:本文旨在解决RLVR训练过程中参数轨迹的几何特性未被充分利用的问题,现有方法在外推性能上存在局限性。
核心思路:通过分析RLVR权重轨迹的低秩特性,提出RELEX方法,利用秩-1近似来高效外推未来的模型检查点。
技术框架:RELEX的整体流程包括从短观察窗口中提取参数增量,估计秩-1子空间,并通过线性回归外推未来的检查点。
关键创新:RELEX的主要创新在于其能够在不增加计算成本的情况下,进行高效的外推,且成功抑制了随机优化中的噪声。
关键设计:RELEX的设计中,关键参数设置包括观察窗口的长度和线性回归的实现,确保了在极少的训练步骤下仍能获得优异的性能。
🖼️ 关键图片
📊 实验亮点
RELEX在多个模型上实现了与RLVR相当或更优的性能,且仅需15%的完整训练步骤。实验结果显示,RELEX能够在观察窗口外推10-20倍的检查点,展现出显著的外推能力和持续的性能提升。
🎯 应用场景
该研究的潜在应用领域包括自然语言处理、对话系统和智能问答等场景。RELEX方法的高效性和准确性使其在需要快速迭代和优化的实际应用中具有重要价值,未来可能推动更多基于RLVR的模型发展。
📄 摘要(原文)
Reinforcement learning with verifiable rewards (RLVR) has become a dominant paradigm for improving reasoning in large language models (LLMs), yet the underlying geometry of the resulting parameter trajectories remains underexplored. In this work, we demonstrate that RLVR weight trajectories are extremely low-rank and highly predictable. Specifically, we find that the majority of downstream performance gains are captured by a rank-1 approximation of the parameter deltas, where the magnitude of this projection evolves near-linearly with training steps. Motivated by this, we propose a simple and compute-efficient method RELEX (REinforcement Learning EXtrapolation), which estimates the rank-1 subspace from a short observation window and extrapolates future checkpoints via linear regression, with no learned model required. Across three models (i.e., Qwen2.5-Math-1.5B, Qwen3-4B-Base, and Qwen3-8B-Base), RELEX produces checkpoints that match or exceed RLVR performance on both in-domain and out-of-domain benchmarks, requiring as few as 15% steps of full RLVR training. Remarkably, RELEX is able to extrapolate far beyond the observation window at no training cost, predicting checkpoints up to 10-20$\times$ beyond the observed prefix with continued improvement (e.g., observe only the first 50 steps and extrapolate to 1000 steps). Our ablation analysis confirms the minimalist sufficiency of RELEX: neither increasing the subspace rank nor employing non-linear modeling yields further gains in extrapolation. Finally, we show that RELEX's success stems from a "denoising" effect: by projecting updates onto the rank-1 subspace, the model discards stochastic optimization noise that would otherwise degrade performance during extrapolation. Our code is available at https://github.com/weizhepei/RELEX.