Scaffold, Not Vocabulary? A Controlled, Two-Tier, Pre-Registered Study of a Popperian Code-Generation Skill

📄 arXiv: 2606.06454v1 📥 PDF

作者: Mehmet Iscan

分类: cs.SE, cs.CL

发布日期: 2026-06-04

备注: 34 pages, 5 figures, 8 tables


💡 一句话要点

提出两层次的消融研究以评估代码生成技能的有效性

🎯 匹配领域: 支柱九:具身大模型 (Embodied Foundation Models)

关键词: 代码生成 波普尔主义 消融实验 大型语言模型 模型偏见 结构性支架 技能评估

📋 核心要点

  1. 现有方法在评估大型语言模型生成代码的能力时,往往受到模型偏见的影响,难以准确判断技能的真实效果。
  2. 论文通过预注册的两层消融实验,设计了多种对照组,以探讨波普尔主义内容与结构性支架对代码生成的影响。
  3. 实验结果表明,在大型模型中未能支持预期的提升,而在小型模型中结构性支架的表现显著,但整体技能未显示出额外的优势。

📝 摘要(中文)

大型语言模型越来越多地参与代码的编写、审查和评估,研究者们通过提示技能使模型像科学家一样推理。本文探讨了这些技能是否真正源于其波普尔主义内容,还是仅仅来自于结构性支架的影响。通过预注册的两层消融实验,研究者们对比了不同条件下的代码生成效果,结果显示在一个大型模型中,预注册的+5分提升并未得到支持,而在小型模型中,结构性支架的表现优于标签仅支架,但整体技能未显示出明显的优势。本文贡献了一个经过校准的负结果和可重复使用的消歧协议,界定了关于这一提示技能家族的工程声明。

🔬 方法详解

问题定义:本文旨在探讨大型语言模型在代码生成中的波普尔主义技能是否真正有效,现有方法常常受到模型偏见的影响,导致结果不可靠。

核心思路:通过设计预注册的两层消融实验,比较不同条件下的代码生成效果,以确定波普尔主义内容与结构性支架的相对贡献。

技术框架:实验包含多个对照组:长度匹配的安慰剂、仅包含标签的支架、执行神谕(HumanEval+单元测试),以及词汇光环监测和同模型自评审计。

关键创新:最重要的创新在于通过消融实验明确区分了波普尔主义内容与结构性支架的影响,提供了一个经过校准的负结果,挑战了现有的工程声明。

关键设计:实验中使用了两个模型(Claude Sonnet 4.6和Qwen2.5-Coder-0.5B),并对比了不同条件下的生成代码的正确性,设置了多个控制变量以确保结果的可靠性。

🖼️ 关键图片

fig_0
fig_1
fig_2

📊 实验亮点

实验结果显示,在大型模型Claude Sonnet 4.6中,预注册的+5分提升未得到支持,而在小型模型Qwen2.5-Coder-0.5B中,结构性支架的表现提升了20-22分,但整体技能未显示出明显的优势,安慰剂组仅落后2.4分。

🎯 应用场景

该研究的潜在应用领域包括代码生成工具的开发、编程教育以及大型语言模型的优化。通过明确技能的有效性,研究可以帮助开发更高效的代码生成系统,提高软件开发的质量和效率。

📄 摘要(原文)

Large language models increasingly write, review, and judge code, and a fast-growing practice equips them with prompt 'skills' that ask the model to reason like a scientist. A prominent example tells the model to act as a Popperian falsificationist, and such skills are reported to improve generated code. But these gains are almost always read off an LLM-as-a-judge, an instrument with documented positional, self-preference, and stylistic biases. We ask: if it appears to help, is the gain from the skill's Popperian content, or from the structure any scaffold imposes? We pre-register a two-tier ablation with three controls: a length-matched placebo, a labels-only scaffold that keeps the Popperian headers but strips the procedure, and an execution oracle (HumanEval+ unit tests), plus a vocabulary-halo sentinel and a same-model self-judge audit. On a frontier model (Claude Sonnet 4.6, N=163) all conditions sit near the benchmark ceiling and do not separate, so the pre-registered +5-point improvement is not supported (a ceiling-limited non-detection). On a small model (Qwen2.5-Coder-0.5B, N=164) structured arms lift best-of-eight correctness by 20-22 points, but the full skill shows no separable benefit over a labels-only scaffold (aggregate F@8=L@8 vs V@8=34.8%), and the placebo trails by only 2.4 points. A 0.5B self-judge applying the Popperian rubric does not beat random selection and concentrates 60% of its picks on one index. In the two settings tested, the skill's Popperian procedural content adds no separable execution-correctness benefit beyond a labels-only scaffold, so the gains track scaffold structure. We contribute a calibrated negative result and a reusable disambiguation protocol; the finding bounds an engineering claim about one prompt-skill family and is not an evaluation of Popperian methodology in general.