When Built-in Thinking Helps and Hurts: Constraint-Level Error Shifts in Instruction Following

作者: Sai Adith Senthil Kumar

分类: cs.CL

发布日期: 2026-06-08

备注: 16 pages, 7 figures, 15 tables

💡 一句话要点

研究思维模式对指令遵循的影响及其局限性

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 大型推理模型 指令遵循 思维模式 错误模式 约束类型 性能评估 人机交互

📋 核心要点

现有大型推理模型在指令遵循任务中的表现不稳定，尤其在错误模式上存在显著差异。
论文通过对Qwen3模型的思维开关控制，探讨思维对指令遵循的影响，提出了约束类型的分类方法。
实验结果显示，思维模式对规划类任务有积极影响，而对精确类任务则表现不佳，且不同模型间的表现存在差异。

📝 摘要（中文）

大型推理模型（LRMs）在数学和编码性能上通常表现良好，但其对指令遵循的影响尚不明确。本文研究了使用Qwen3模型（1.7B-32B）进行的IFEval，采用相同权重的思维开关控制。结果显示，整体通过率变化较小（-0.55至-3.52个百分点），但有10-20%的提示在不同模式间切换通过与失败，表明思维改变了错误模式。通过后验分析，约束类型分为规划和精确，前者在思维下表现改善，而后者则持续恶化。思维还改变了最终答案的长度，匹配长度分析显著减少了精确度的下降，但仍存在残余惩罚。

🔬 方法详解

问题定义：本文旨在探讨大型推理模型在指令遵循任务中的表现，尤其是思维模式对错误类型的影响。现有方法未能充分解释思维对指令遵循的复杂影响，导致性能不稳定。

核心思路：通过使用Qwen3模型的思维开关控制，分析不同约束类型（规划与精确）在思维模式下的表现差异，揭示思维对错误模式的影响。

技术框架：研究采用了IFEval评估框架，结合Qwen3模型和Hunyuan模型，进行多种实验以比较思维开关对指令遵循的影响。主要模块包括思维开关控制、约束类型分类和最终答案长度分析。

关键创新：提出了约束类型的分类方法，明确区分规划和精确任务的表现，揭示思维对不同类型任务的影响机制，这是对现有研究的重要补充。

关键设计：在实验中，采用了相同权重的思维开关控制，分析了不同模型（1.7B-32B）在思维模式下的表现，并使用交叉编码器相关性度量分析思维轨迹。

🖼️ 关键图片

📊 实验亮点

实验结果显示，思维模式对规划类任务的通过率有显著提升，而精确类任务则表现不佳，整体通过率变化在-0.55至-3.52个百分点之间。值得注意的是，10-20%的提示在不同模式间切换通过与失败，显示出思维对错误模式的复杂影响。

🎯 应用场景

该研究的潜在应用领域包括教育技术、智能助手和自动化系统，能够帮助优化指令遵循的算法设计，提高人机交互的效率和准确性。未来可能对大型推理模型的训练和应用提供新的视角，推动智能系统的进一步发展。

📄 摘要（原文）

Large reasoning models (LRMs) often improve math and coding performance, but their effect on instruction following is unclear. We study IFEval with Qwen3 models (1.7B-32B), using same-weights Thinking ON/OFF controls; four Hunyuan models provide directional cross-family support. Aggregate pass-rate changes are small (-0.55 to -3.52 pp), yet 10-20% of prompts switch between pass and fail across modes, suggesting that thinking changes the pattern of errors--some prompts improve while others worsen--rather than uniformly degrading performance. Under a post-hoc Qwen3-derived grouping, constraint types separate into Planning (global counting, structure, coordination), which improves at the class level under thinking, and Precision (exact local form), which consistently worsens; the class-level Planning/Precision sign pattern holds directionally for all four Hunyuan models despite Hunyuan's opposite aggregate direction. Thinking also changes final-answer length; matched-length analyses substantially reduce the Precision drop, but a residual penalty remains. Analyzing thinking traces with a cross-encoder relevance metric reveals three patterns: Neutral shows a positive relevance-compliance link (r approximately 0.15); Planning shows near-zero predictive correlation (r approximately 0.02) despite measurable trace engagement, consistent with an execution gap between CE-measured trace relevance and final-answer compliance; Precision shows a small negative correlation (r approximately -0.05), with failing instances having higher mean relevance than passing ones. Activation patching across four model sizes (1.7B-14B) shows that Precision flip instances are more often restored than Planning flip instances (32-58% vs. 14-40% mean layer-restoration), with the largest gap at 14B (about 30 pp).

When Built-in Thinking Helps and Hurts: Constraint-Level Error Shifts in Instruction Following

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理