When Built-in Thinking Helps and Hurts: Constraint-Level Error Shifts in Instruction Following

📄 arXiv: 2606.09662v1 📥 PDF

作者: Sai Adith Senthil Kumar

分类: cs.CL

发布日期: 2026-06-08

备注: 16 pages, 7 figures, 15 tables


💡 一句话要点

研究思维模式对指令遵循的影响及其局限性

🎯 匹配领域: 支柱九:具身大模型 (Embodied Foundation Models)

关键词: 大型推理模型 指令遵循 思维模式 错误模式 约束类型 性能评估 人机交互

📋 核心要点

  1. 现有大型推理模型在指令遵循任务中的表现不稳定,尤其在错误模式上存在显著差异。
  2. 论文通过对Qwen3模型的思维开关控制,探讨思维对指令遵循的影响,提出了约束类型的分类方法。
  3. 实验结果显示,思维模式对规划类任务有积极影响,而对精确类任务则表现不佳,且不同模型间的表现存在差异。

📝 摘要(中文)

大型推理模型(LRMs)在数学和编码性能上通常表现良好,但其对指令遵循的影响尚不明确。本文研究了使用Qwen3模型(1.7B-32B)进行的IFEval,采用相同权重的思维开关控制。结果显示,整体通过率变化较小(-0.55至-3.52个百分点),但有10-20%的提示在不同模式间切换通过与失败,表明思维改变了错误模式。通过后验分析,约束类型分为规划和精确,前者在思维下表现改善,而后者则持续恶化。思维还改变了最终答案的长度,匹配长度分析显著减少了精确度的下降,但仍存在残余惩罚。

🔬 方法详解

问题定义:本文旨在探讨大型推理模型在指令遵循任务中的表现,尤其是思维模式对错误类型的影响。现有方法未能充分解释思维对指令遵循的复杂影响,导致性能不稳定。

核心思路:通过使用Qwen3模型的思维开关控制,分析不同约束类型(规划与精确)在思维模式下的表现差异,揭示思维对错误模式的影响。

技术框架:研究采用了IFEval评估框架,结合Qwen3模型和Hunyuan模型,进行多种实验以比较思维开关对指令遵循的影响。主要模块包括思维开关控制、约束类型分类和最终答案长度分析。

关键创新:提出了约束类型的分类方法,明确区分规划和精确任务的表现,揭示思维对不同类型任务的影响机制,这是对现有研究的重要补充。

关键设计:在实验中,采用了相同权重的思维开关控制,分析了不同模型(1.7B-32B)在思维模式下的表现,并使用交叉编码器相关性度量分析思维轨迹。

🖼️ 关键图片

fig_0
fig_1
fig_2

📊 实验亮点

实验结果显示,思维模式对规划类任务的通过率有显著提升,而精确类任务则表现不佳,整体通过率变化在-0.55至-3.52个百分点之间。值得注意的是,10-20%的提示在不同模式间切换通过与失败,显示出思维对错误模式的复杂影响。

🎯 应用场景

该研究的潜在应用领域包括教育技术、智能助手和自动化系统,能够帮助优化指令遵循的算法设计,提高人机交互的效率和准确性。未来可能对大型推理模型的训练和应用提供新的视角,推动智能系统的进一步发展。

📄 摘要(原文)

Large reasoning models (LRMs) often improve math and coding performance, but their effect on instruction following is unclear. We study IFEval with Qwen3 models (1.7B-32B), using same-weights Thinking ON/OFF controls; four Hunyuan models provide directional cross-family support. Aggregate pass-rate changes are small (-0.55 to -3.52 pp), yet 10-20% of prompts switch between pass and fail across modes, suggesting that thinking changes the pattern of errors--some prompts improve while others worsen--rather than uniformly degrading performance. Under a post-hoc Qwen3-derived grouping, constraint types separate into Planning (global counting, structure, coordination), which improves at the class level under thinking, and Precision (exact local form), which consistently worsens; the class-level Planning/Precision sign pattern holds directionally for all four Hunyuan models despite Hunyuan's opposite aggregate direction. Thinking also changes final-answer length; matched-length analyses substantially reduce the Precision drop, but a residual penalty remains. Analyzing thinking traces with a cross-encoder relevance metric reveals three patterns: Neutral shows a positive relevance-compliance link (r approximately 0.15); Planning shows near-zero predictive correlation (r approximately 0.02) despite measurable trace engagement, consistent with an execution gap between CE-measured trace relevance and final-answer compliance; Precision shows a small negative correlation (r approximately -0.05), with failing instances having higher mean relevance than passing ones. Activation patching across four model sizes (1.7B-14B) shows that Precision flip instances are more often restored than Planning flip instances (32-58% vs. 14-40% mean layer-restoration), with the largest gap at 14B (about 30 pp).