Procedural-skill SFT across capacity tiers: A W-Shaped pre-SFT Trajectory and Regime-Asymmetric Mechanism on 0.8B-4B Qwen3.5 Models

作者: Igor Strozzi

分类: cs.LG

发布日期: 2026-05-12

💡 一句话要点

提出W形预训练轨迹以优化Qwen3.5模型的程序技能SFT

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 程序技能 SFT W形轨迹 Qwen3.5 模型评估 自然语言处理 智能教育

📋 核心要点

现有方法在不同规模的模型上对程序技能的提升效果不均，且缺乏系统的评估机制。
论文提出了一种W形预训练轨迹的机制，通过分析不同规模模型的表现，优化了程序技能的SFT过程。
实验结果显示，SFT在0.8B、2B和4B模型上的提升分别为+0.070、+0.040和+0.075，验证了提出方法的有效性。

📝 摘要（中文）

本文测量了在三种Qwen3.5密集规模（0.8B、2B、4B）下程序技能SFT的贡献，使用200个任务和40个技能的保留集，Claude Haiku 4.5作为参考。主要发现是，在匹配路径的LLM评分下，SFT可归因的程序技能提升在不同规模间大致均匀。预训练的W形轨迹主导了后SFT的变化，提出的机制在不同模型规模间表现出不对称的模式，具有可验证的预测。方法论部分包括对评估格式的合规性分析和负迭代序列的研究，确保了结果的可靠性和一致性。

🔬 方法详解

问题定义：本文旨在解决不同规模Qwen3.5模型在程序技能SFT上的贡献不均问题，现有方法缺乏对模型规模与SFT效果之间关系的深入分析。

核心思路：通过引入W形预训练轨迹，论文提出了一种新的机制来评估和优化程序技能的SFT过程，强调了模型在不同规模下的表现差异。

技术框架：研究采用了一个包含200个任务和40个技能的保留集，结合了对评估格式的合规性分析和负迭代序列的研究，确保了结果的可靠性。

关键创新：最重要的创新在于提出了W形预训练轨迹，揭示了不同规模模型在程序技能提升上的不对称模式，与现有方法相比，提供了更为系统的评估机制。

关键设计：在实验中，使用了83.5%的确定性答案提取器，确保了数据的准确性；同时，负迭代序列的设计使得不同模型的SFT效果在绝对通过率上受限于基础能力，而非配方。

🖼️ 关键图片

📊 实验亮点

实验结果显示，SFT在不同规模模型上的提升分别为+0.070、+0.040和+0.075，且在0.8B和4B模型上表现出W形轨迹的影响，验证了提出的机制的有效性和一致性。

🎯 应用场景

该研究的潜在应用领域包括自然语言处理、智能教育系统和人机交互等。通过优化程序技能的SFT过程，可以提升模型在复杂任务中的表现，具有重要的实际价值和未来影响。

📄 摘要（原文）

We measure procedural-skill SFT contribution across three Qwen3.5 dense scales (0.8B, 2B, 4B) on a 200-task / 40-skill holdout, with Claude Haiku 4.5 as a frontier reference. The corpus is 353 rows of (task + procedural-skill block, Opus chain-of-thought, judge-pass) demonstrations. \textbf{Main finding.} Under matched-path LLM-only scoring, the SFT-attributable procedural-$Δ$ lift is roughly uniform across sizes: $+0.070$ / $+0.040$ / $+0.075$ at 0.8B / 2B / 4B. Variation in post-SFT $Δ$ ($-0.005$, $+0.100$, $+0.065$) is dominated by a W-shaped pre-SFT base trajectory ($-0.075$, $+0.060$, $-0.010$, Haiku-4-5 at $+0.030$): the 5-step procedure hurts 0.8B and 4B, helps 2B, and helps frontier Haiku modestly. SFT works hardest in absolute terms where the base struggles with the procedure -- a regime-asymmetric pattern with a falsifiable prediction at 8B/14B. \textbf{Methodology.} (i) A bench format-compliance artifact: 83.5\% of the holdout uses a deterministic \texttt{ANSWER}-line extractor that under-counts free-form conclusions; an LLM-only re-judge reveals it was systematically biased against \CU. (ii) A negative-iteration sequence at 0.8B: five recipe variants cluster post-SFT \CU{} pass-rate within a 2\,pp band, constraining the absolute-pass-rate ceiling to base capacity rather than recipe. \textbf{Cross-family validation.} GPT-5.4 via OpenRouter on all 7 configurations (2800 paired episodes) agrees on the direction of every per-student finding: Cohen's $κ\geq 0.754$, agreement $\geq 93.25\%$. Earlier format-only at 0.8B'' andshrinking SFT at 4B'' framings were path-mismatch artifacts; this paper supersedes both (Appendix~\ref{sec:appendix-path}). Single-seed; threats in §\ref{sec:threats}.

Procedural-skill SFT across capacity tiers: A W-Shaped pre-SFT Trajectory and Regime-Asymmetric Mechanism on 0.8B-4B Qwen3.5 Models

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理