Chain of Code: Reasoning with a Language Model-Augmented Code Emulator

作者: Chengshu Li, Jacky Liang, Andy Zeng, Xinyun Chen, Karol Hausman, Dorsa Sadigh, Sergey Levine, Li Fei-Fei, Fei Xia, Brian Ichter

分类: cs.CL, cs.AI, cs.LG, cs.RO

发布日期: 2023-12-07 (更新: 2024-07-29)

备注: ICML 2024 Oral; Project webpage: https://chain-of-code.github.io

💡 一句话要点

提出Chain of Code，通过代码模拟增强语言模型推理能力，提升语义理解任务性能。

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 代码推理 语言模型 语义理解 Chain of Thought 代码模拟 LMulator

📋 核心要点

现有语言模型在处理逻辑、算术和语义混合的复杂推理任务时，难以生成可执行的代码，尤其是在处理语义理解的边缘情况时。
Chain of Code鼓励语言模型编写代码，并选择性地模拟解释器的行为，生成语义子任务的预期输出，从而增强推理能力。
实验结果表明，Chain of Code在多个基准测试中超越了Chain of Thought等方法，并在BIG-Bench Hard上取得了显著的性能提升。

📝 摘要（中文）

本文提出Chain of Code (CoC)，一种简单但效果显著的扩展方法，旨在提升语言模型（LM）在代码驱动推理方面的能力。CoC的核心思想是鼓励LM将程序中的语义子任务格式化为灵活的伪代码，以便解释器能够显式地捕获未定义的行为，并将其传递给LM进行模拟（作为“LMulator”）。实验表明，Chain of Code在各种基准测试中优于Chain of Thought和其他基线；在BIG-Bench Hard上，Chain of Code达到了84%的准确率，比Chain of Thought提高了12%。总之，CoC通过“用代码思考”扩展了LM可以回答的推理问题的范围。

🔬 方法详解

问题定义：论文旨在解决语言模型在复杂推理任务中，特别是涉及语义理解和逻辑/算术混合的任务时，难以生成可执行代码的问题。现有方法，如Chain of Thought，虽然能引导模型逐步推理，但在处理需要复杂语义理解的子任务时，仍然难以生成能够被代码解释器执行的有效代码，尤其是在处理边缘情况时。

核心思路：论文的核心思路是让语言模型不仅生成代码，还要选择性地“模拟”代码解释器的行为。具体来说，对于那些语言模型难以生成可执行代码的语义子任务，模型可以生成伪代码，并预测该伪代码的预期输出。这种“代码+模拟输出”的方式，使得模型能够更好地处理复杂的语义推理，并将其融入到整体的推理流程中。

技术框架：Chain of Code (CoC) 的整体框架可以概括为以下几个步骤：1. 语言模型接收到推理任务的输入。2. 语言模型生成包含代码和伪代码的程序。对于可以被代码解释器执行的部分，生成标准代码；对于难以生成可执行代码的语义子任务，生成伪代码。3. 代码解释器执行程序中的代码部分。4. 当代码解释器遇到伪代码时，将伪代码及其上下文传递给语言模型进行模拟执行，即“LMulator”。5. LMulator根据伪代码和上下文，生成预期的输出。6. 代码解释器将LMulator的输出作为伪代码的执行结果，继续执行后续的代码。

关键创新：CoC的关键创新在于引入了“LMulator”的概念，即利用语言模型来模拟代码解释器的行为。这使得语言模型能够处理那些难以生成可执行代码的语义子任务，从而扩展了语言模型可以处理的推理问题的范围。与传统的Chain of Thought相比，CoC更加灵活，能够将代码执行和语义模拟相结合，从而更好地解决复杂的推理问题。

关键设计：CoC的关键设计在于如何鼓励语言模型生成合适的伪代码，并准确地模拟代码解释器的行为。论文中并没有详细描述具体的参数设置、损失函数或网络结构，这部分内容可能依赖于具体的语言模型和任务。但核心思想是，通过合适的prompting，引导语言模型将语义子任务格式化为伪代码，并生成预期的输出。此外，如何有效地将代码执行和语义模拟相结合，也是一个关键的设计考虑。

📊 实验亮点

实验结果表明，Chain of Code在多个基准测试中优于Chain of Thought和其他基线方法。在BIG-Bench Hard基准测试中，Chain of Code达到了84%的准确率，相比Chain of Thought提高了12%。这表明Chain of Code能够显著提升语言模型在复杂推理任务中的性能。

🎯 应用场景

Chain of Code具有广泛的应用前景，可以应用于各种需要复杂推理和语义理解的任务中，例如问答系统、对话系统、智能助手等。该方法可以提升这些系统在处理复杂问题时的准确性和可靠性，并有望推动人工智能在更广泛领域的应用。

📄 摘要（原文）

Code provides a general syntactic structure to build complex programs and perform precise computations when paired with a code interpreter - we hypothesize that language models (LMs) can leverage code-writing to improve Chain of Thought reasoning not only for logic and arithmetic tasks, but also for semantic ones (and in particular, those that are a mix of both). For example, consider prompting an LM to write code that counts the number of times it detects sarcasm in an essay: the LM may struggle to write an implementation for "detect_sarcasm(string)" that can be executed by the interpreter (handling the edge cases would be insurmountable). However, LMs may still produce a valid solution if they not only write code, but also selectively "emulate" the interpreter by generating the expected output of "detect_sarcasm(string)". In this work, we propose Chain of Code (CoC), a simple yet surprisingly effective extension that improves LM code-driven reasoning. The key idea is to encourage LMs to format semantic sub-tasks in a program as flexible pseudocode that the interpreter can explicitly catch undefined behaviors and hand off to simulate with an LM (as an "LMulator"). Experiments demonstrate that Chain of Code outperforms Chain of Thought and other baselines across a variety of benchmarks; on BIG-Bench Hard, Chain of Code achieves 84%, a gain of 12% over Chain of Thought. In a nutshell, CoC broadens the scope of reasoning questions that LMs can answer by "thinking in code".

Chain of Code: Reasoning with a Language Model-Augmented Code Emulator

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册