Stage-1 Controls the Entropy Regime, Not the Outcome

作者: Jianxiong Shen

分类: cs.LG, cs.AI, cs.CV

发布日期: 2026-06-08

💡 一句话要点

研究Stage-1对熵状态的影响而非结果的控制

🎯 匹配领域: 支柱二：RL算法与架构 (RL & Architecture)

关键词: 两阶段训练 视觉-语言模型 策略熵 强化学习 小数据集 监督微调 在线蒸馏 模型优化

📋 核心要点

现有的两阶段后训练方法在小数据集上对模型性能的影响尚不明确，特别是Stage-1的作用。
论文通过对比监督微调和在线蒸馏，探讨Stage-1如何影响模型的熵状态和后续强化学习的表现。
实验结果显示，虽然Stage-1对最终结果的影响有限，但在策略熵和答案多样性方面存在显著差异。

📝 摘要（中文）

本研究探讨了两阶段后训练方法中Stage-1的实际控制作用，特别是在小数据集上使用Qwen2.5-VL-7B模型进行的实验。结果显示，Stage-1的不同初始化对模型在几何任务上的表现影响有限，但在熵状态方面存在显著差异。通过对比监督微调(SFT)和在线蒸馏(OPD)，发现OPD在进入强化学习(RL)阶段时具有更高的策略熵和答案多样性，尽管最终的下游任务表现提升较小。这一研究为理解不同训练阶段的作用提供了实证依据。

🔬 方法详解

问题定义：本研究旨在探讨两阶段后训练方法中Stage-1的实际控制作用，尤其是在小数据集上，现有方法未能明确Stage-1对模型性能的影响。

核心思路：通过对比不同的初始化方法（监督微调与在线蒸馏），分析Stage-1如何影响模型的熵状态及其在后续强化学习中的表现。

技术框架：研究采用Qwen2.5-VL-7B模型，进行两阶段训练：Stage-1为暖启动阶段（SFT或OPD），Stage-2为强化学习阶段。实验通过Geometry3K和MathVista数据集进行验证。

关键创新：本研究的创新在于揭示了Stage-1与熵状态之间的强关联性，尽管其对下游任务的实际收益较小，提供了对训练阶段作用的新的实证理解。

关键设计：在实验中，使用了不同的初始化策略，设置了相应的超参数，并通过早停策略优化了SFT的训练过程，以确保模型在不同阶段的表现可比性。实验还关注了策略熵和答案多样性等关键指标。

🖼️ 关键图片

📊 实验亮点

实验结果表明，OPD在进入强化学习阶段时的策略熵显著高于SFT，且在答案多样性方面也有提升。然而，最终的任务表现提升有限，尤其在MathVista数据集上，六个模型的表现相差不超过1.2分，显示出Stage-1对结果的影响并不显著。

🎯 应用场景

该研究的成果可广泛应用于视觉-语言模型的训练优化，尤其是在小数据集场景下。理解不同训练阶段的作用有助于提升模型的学习效率和泛化能力，未来可能影响多模态学习和强化学习的研究方向。

📄 摘要（原文）

Two-stage post-training -- a Stage-1 warm-start (supervised fine-tuning, SFT, or on-policy distillation, OPD) followed by Stage-2 reinforcement learning (RL) -- is increasingly used for vision-language models (VLMs). We ask what Stage-1 actually controls in a small-data study using Qwen2.5-VL-7B with a same-modality 72B VLM teacher for OPD. First, the three warm-starts reach a narrow $53$--$54\%$ band on Geometry3K internal validation, consistent with the narrow range reported by recent specialized methods; this setup provides little evidence that Stage-1 changes the in-domain endpoint. Second, a matched-recipe, early-stopped SFT improves out-of-domain MathVista by $+2.1$ points, reversing the $-9.5$-point drop of an over-trained variant. The clearest difference is the \emph{entropy regime}: OPD enters RL with substantially higher policy entropy than either SFT initialization, and the separation remains visible through the available trajectories. At the in-domain initialization, OPD also has higher answer diversity and pass@16 ($+2.0$ to $+5.2$ points over SFT), although problem-level bootstrap intervals show that the smaller contrast is uncertain. The advantage is absent after RL (endpoint pass@16 values within $1.1$ points) and on MathVista (six models within $1.2$ points). Our contribution is therefore a bounded empirical characterization: Stage-1 is strongly associated with the entropy regime in this setup, but the downstream payoff is small, localized, and not evidence that OPD is a better RL warm-start.

Stage-1 Controls the Entropy Regime, Not the Outcome

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理