Self-Interpretability: LLMs Can Describe Complex Internal Processes that Drive Their Decisions

作者: Dillon Plunkett, Adam Morris, Keerthi Reddy, Jorge Morales

分类: cs.CL

发布日期: 2025-05-21 (更新: 2025-11-10)

💡 一句话要点

自解释性：大语言模型能描述驱动决策的复杂内部过程

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 大语言模型 可解释性 自解释性 决策过程 微调训练

📋 核心要点

现有大语言模型的决策过程难以解释，神经网络的内部机制复杂，缺乏对其内部运作的深入理解。
该研究探索并提升LLM自我解释能力，通过训练使模型能够准确描述其决策过程中的定量特征。
实验表明，微调后的GPT-4o和GPT-4o-mini能准确报告决策偏好，且该能力可泛化到其他复杂决策。

📝 摘要（中文）

我们对大语言模型（LLMs）的响应方式及其原因的理解仍然有限。其神经网络的解释性一直具有挑战性，我们才刚刚开始梳理其中单个神经元和电路的功能。然而，理解这些系统的另一条途径是调查和发展它们解释自身功能的能力。本文表明：i) LLMs可以准确地描述其在某些决策过程中的内部过程的定量特征；ii) 可以通过训练来提高这些能力。为此，我们对GPT-4o和GPT-4o-mini进行了微调，使其能够在各种复杂环境中（例如，在公寓、贷款、假期等之间进行选择）根据随机生成的、关于如何权衡不同属性的定量偏好（例如，自然光与公寓安静环境的相对重要性）做出决策。我们证明了LLMs可以准确地报告这些偏好（即，它们在决策过程中学会赋予不同属性的权重）。接下来，我们证明了可以对这些LLMs进行微调，以更准确地解释它们的决策过程。最后，我们证明了这种训练具有泛化性：它提高了模型准确解释它们如何做出其他复杂决策的能力，而不仅仅是它们已被微调以做出的决策。这项工作是朝着训练LLMs准确而广泛地报告其自身内部过程迈出的一步——这种可能性将为可解释性、控制和安全性带来巨大的好处。

🔬 方法详解

问题定义：目前大型语言模型（LLMs）的决策过程是一个黑盒，我们很难理解模型为什么会做出特定的选择。现有的解释方法，如关注度机制可视化，往往只能提供有限的洞察，无法准确反映模型内部的推理过程。因此，如何让LLMs能够自我解释，揭示其决策背后的逻辑，是一个重要的研究问题。

核心思路：本文的核心思路是训练LLMs使其能够准确地报告其在决策过程中使用的内部权重和偏好。通过让模型显式地表达其对不同属性的重视程度，从而揭示其决策的依据。这种方法类似于让模型“说出”其思考过程，从而提高其透明度和可解释性。

技术框架：该研究的技术框架主要包括以下几个步骤：1) 构建一个包含多种复杂决策场景的数据集，例如选择公寓、贷款、假期等。每个场景都包含多个属性，并且每个属性都有一个随机生成的权重，代表其重要性。2) 使用该数据集对GPT-4o和GPT-4o-mini进行微调，使其能够根据属性权重做出决策。3) 训练模型解释其决策过程，即让模型报告其在决策中使用的属性权重。4) 评估模型报告的权重与真实权重之间的差异，从而衡量模型的自我解释能力。

关键创新：该研究的关键创新在于证明了LLMs不仅可以做出复杂的决策，还可以准确地描述其决策过程中的内部状态。此外，该研究还表明，通过微调可以显著提高LLMs的自我解释能力，并且这种能力可以泛化到其他决策场景。

关键设计：该研究的关键设计包括：1) 使用随机生成的属性权重来模拟真实的决策场景，从而避免了人为偏见。2) 使用GPT-4o和GPT-4o-mini作为基础模型，利用其强大的语言生成能力来解释决策过程。3) 设计合适的损失函数来鼓励模型准确地报告其内部权重。具体的损失函数细节在论文中未明确说明，属于未知信息。

🖼️ 关键图片

📊 实验亮点

实验结果表明，经过微调的GPT-4o和GPT-4o-mini能够准确地报告其在决策过程中使用的属性权重。更重要的是，这种自我解释能力可以泛化到其他复杂的决策场景，而不仅仅是模型被微调过的场景。这表明，通过训练，LLMs可以具备更广泛的自我解释能力。

🎯 应用场景

该研究成果可应用于需要高度透明和可解释性的领域，如金融、医疗、法律等。例如，在贷款审批中，模型可以解释其拒绝或批准贷款的原因，从而提高公平性和可信度。此外，该技术还可以用于调试和优化LLMs，提高其性能和安全性，并促进人机协作。

📄 摘要（原文）

We have only limited understanding of how and why large language models (LLMs) respond in the ways that they do. Their neural networks have proven challenging to interpret, and we are only beginning to tease out the function of individual neurons and circuits within them. However, another path to understanding these systems is to investigate and develop their capacity to explain their own functioning. Here, we show that i) LLMs can accurately describe quantitative features of their own internal processes during certain kinds of decision-making and ii) that it is possible to improve these capabilities through training. To do so, we fine-tuned GPT-4o and GPT-4o-mini to make decisions in a wide variety of complex contexts (e.g., choosing between condos, loans, vacations, etc.) according to randomly-generated, quantitative preferences about how to weigh different attributes (e.g., the relative importance of natural light versus quiet surroundings for condos). We demonstrate that the LLMs can accurately report these preferences (i.e., the weights that they learned to give to different attributes during decision-making). Next, we demonstrate that these LLMs can be fine-tuned to explain their decision-making even more accurately. Finally, we demonstrate that this training generalizes: It improves the ability of the models to accurately explain how they make other complex decisions, not just decisions they have been fine-tuned to make. This work is a step towards training LLMs to accurately and broadly report on their own internal processes -- a possibility that would yield substantial benefits for interpretability, control, and safety.

Self-Interpretability: LLMs Can Describe Complex Internal Processes that Drive Their Decisions

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理