StepCodeReasoner: Aligning Code Reasoning with Stepwise Execution Traces via Reinforcement Learning

作者: Hao Wang, Rui Li, Lei Sha, Jie M. Zhang

分类: cs.SE, cs.CL

发布日期: 2026-05-12

💡 一句话要点

StepCodeReasoner：通过强化学习对齐代码推理与逐步执行轨迹

🎯 匹配领域: 支柱二：RL算法与架构 (RL & Architecture)

关键词: 代码推理 强化学习 执行轨迹 代码生成 程序理解

📋 核心要点

现有代码推理方法缺乏对中间执行状态的监督，容易导致奖励欺骗和推理过程不一致。
StepCodeReasoner通过在代码中插入执行轨迹锚点，将代码推理转化为可验证的逐步执行建模问题。
提出的双层GRPO强化学习算法，在轨迹间和轨迹内两个层面进行结构化信用分配，提升模型性能。

📝 摘要（中文）

现有的代码推理方法主要监督最终的代码输出，忽略了中间状态，这常常导致奖励欺骗，即通过不一致的推理获得正确的答案。我们提出了StepCodeReasoner，一个引入显式中间执行状态监督的框架。通过自动插入基于结构化打印的执行轨迹锚点到代码中，该模型被训练来预测每个步骤的运行时状态，将代码推理转化为一个可验证的、逐步执行建模问题。基于这种执行感知方法，我们引入了双层GRPO，一种用于结构化信用分配的强化学习算法，它在两个层面上工作：轨迹间，比较替代执行路径；轨迹内，基于中间准确性对其下游正确性的影响进行奖励。大量的实验表明，StepCodeReasoner在代码推理方面取得了SOTA性能。特别是，我们的7B模型在CRUXEval上达到了91.1%，在LiveCodeBench上达到了86.5%，超过了CodeReasoner-7B基线（86.0%和77.7%）和GPT-4o（85.6%和75.1%）。此外，在执行轨迹基准REval上，我们的模型得分82.9%，超过了基线CodeReasoner-7B（72.3%），其14B版本（81.1%）和GPT-4o（77.3%）。此外，我们的方法还提高了代码生成性能，表明显式执行建模增强了代码推理和代码生成。

🔬 方法详解

问题定义：现有代码推理方法主要依赖最终输出进行监督，忽略了代码执行的中间状态。这种做法的痛点在于，模型可能通过不正确的推理过程得到正确的最终结果，即“奖励欺骗”。这使得模型的推理过程不可靠，难以保证其泛化能力。

核心思路：StepCodeReasoner的核心思路是将代码推理过程分解为一系列可验证的步骤，通过显式地监督每个步骤的执行状态，确保推理过程的正确性和一致性。通过将代码推理转化为逐步执行建模问题，可以更有效地利用中间状态的信息，避免奖励欺骗。

技术框架：StepCodeReasoner框架主要包含以下几个步骤：1) 自动在代码中插入结构化的打印语句，作为执行轨迹的锚点。2) 模型在训练过程中，需要预测每个锚点处的运行时状态。3) 使用双层GRPO强化学习算法进行训练，该算法在轨迹间比较不同的执行路径，在轨迹内根据中间状态的准确性对其下游正确性的影响进行奖励。

关键创新：StepCodeReasoner的关键创新在于引入了显式的中间执行状态监督，将代码推理转化为可验证的逐步执行建模问题。与现有方法相比，StepCodeReasoner不仅关注最终输出，还关注推理过程的每一步是否正确。此外，双层GRPO强化学习算法能够更有效地进行信用分配，提升模型的性能。

关键设计：在代码中插入的执行轨迹锚点是结构化的，包含了关键变量的值和程序状态信息。双层GRPO强化学习算法的设计考虑了轨迹间和轨迹内的依赖关系，通过比较不同的执行路径和评估中间状态的准确性，更有效地进行信用分配。具体的损失函数和网络结构细节在论文中进行了详细描述（未知）。

🖼️ 关键图片

📊 实验亮点

StepCodeReasoner在CRUXEval和LiveCodeBench等代码推理基准测试中取得了SOTA性能。7B模型在CRUXEval上达到了91.1%，在LiveCodeBench上达到了86.5%，显著超过了CodeReasoner-7B基线和GPT-4o。在执行轨迹基准REval上，StepCodeReasoner也取得了显著的提升，超过了CodeReasoner-7B及其14B版本和GPT-4o。

🎯 应用场景

StepCodeReasoner具有广泛的应用前景，可用于提升代码自动生成、代码调试、程序理解等任务的性能。该方法能够提高代码推理的可靠性和可解释性，有助于开发更智能、更可靠的软件系统。此外，该方法还可以应用于教育领域，帮助学生更好地理解代码执行过程。

📄 摘要（原文）

Existing code reasoning methods primarily supervise final code outputs, ignoring intermediate states, often leading to reward hacking where correct answers are obtained through inconsistent reasoning. We propose StepCodeReasoner, a framework that introduces explicit intermediate execution-state supervision. By automatically inserting structured print-based execution-trace anchors into code, the model is trained to predict runtime states at each step, transforming code reasoning into a verifiable, stepwise execution modeling problem. Building on this execution-aware method, we introduce Bi-Level GRPO, a reinforcement learning algorithm for structured credit assignment at two levels: inter-trajectory, comparing alternative execution paths, and intra-trajectory, rewarding intermediate accuracy based on its impact on downstream correctness. Extensive experiments demonstrate that StepCodeReasoner achieves SOTA performance in code reasoning. In particular, our 7B model achieves 91.1\% on CRUXEval and 86.5\% on LiveCodeBench, outperforming the CodeReasoner-7B baseline (86.0\% and 77.7\%) and GPT-4o (85.6\% and 75.1\%). Furthermore, on the execution-trace benchmark REval, our model scores 82.9\%, outperforming baseline CodeReasoner-7B (72.3\%), its 14B counterpart (81.1\%), and GPT-4o (77.3\%). Additionally, our approach also improves code generation performance, demonstrating that explicit execution modeling enhances both code reasoning and code generation.

StepCodeReasoner: Aligning Code Reasoning with Stepwise Execution Traces via Reinforcement Learning

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理