How LLMs Detect and Correct Their Own Errors: The Role of Internal Confidence Signals

作者: Dharshan Kumaran, Viorica Patraucean, Simon Osindero, Petar Velickovic, Nathaniel Daw

分类: cs.LG

发布日期: 2026-04-24

💡 一句话要点

大型语言模型通过内部置信度信号检测和纠正自身错误

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 大型语言模型 自我纠错 置信度信号 二阶模型 错误检测 PANL信号 因果干预

📋 核心要点

大型语言模型（LLMs）能够自我纠错，但其内在机制尚不清楚，现有方法难以解释这一现象。
该研究基于二阶置信度模型，认为LLMs通过答案后换行符（PANL）处的内部评估信号进行错误检测和自我纠正。
实验表明，PANL信号超越了传统置信度指标，能更准确地预测错误检测和纠正能力，并在多种模型和任务中验证了这一发现。

📝 摘要（中文）

大型语言模型无需外部反馈即可检测自身错误并进行纠正，但其内在机制尚不明确。本文从决策神经科学的二阶置信度模型的角度对此进行研究。在一阶系统中，置信度源于生成信号本身，因此对于所选响应而言是最大的，从而排除了错误检测的可能性。二阶模型假设存在一个部分独立的评估信号，该信号可能与已提交的响应不一致，从而为错误检测提供基础。Kumaran等人(2026)表明，LLM在答案之后的token（即答案后换行符：PANL）处缓存置信度表示，该表示在因果上驱动口头置信度并与log-probabilities分离。本文测试了PANL信号是否超出置信度范围，以支持错误检测和自我纠正。使用验证-然后-纠正范式，结果表明：（i）口头置信度预测错误检测的能力远超token log-probabilities，排除了基于一阶的解释；（ii）PANL激活预测错误检测的能力超越口头置信度本身；（iii）PANL预测模型可以纠正哪些错误——而所有行为信号都无法做到这一点。因果干预证实，当答案信息损坏时，PANL信号可以挽救错误检测行为。所有发现都在模型（Gemma 3 27B和Qwen 2.5 7B）和任务（TriviaQA和MNLI）中得到重复验证。这些结果表明，LLM自然地实现了一种二阶置信度架构，其内部评估信号不仅编码了答案可能错误，还编码了模型是否具有修复它的知识。

🔬 方法详解

问题定义：大型语言模型在没有外部反馈的情况下，如何检测并纠正自身的错误？现有方法，例如基于token log-probabilities的方法，无法充分解释LLM的自我纠错能力，因为它们属于一阶置信度模型，置信度直接来源于生成信号，无法有效区分正确和错误的答案。

核心思路：论文的核心思路是借鉴决策神经科学中的二阶置信度模型，认为LLM内部存在一个独立的评估信号，该信号可以与生成答案的信号相比较，从而检测出错误。这个评估信号体现在答案后换行符（PANL）处的激活值上，PANL信号不仅反映了置信度，还包含了模型是否具备纠正错误所需知识的信息。

技术框架：论文采用“验证-然后-纠正”的范式。首先，模型生成一个答案。然后，模型需要验证该答案是否正确。如果模型检测到错误，则尝试纠正它。研究人员通过观察和干预PANL信号，以及分析口头置信度等行为信号，来研究LLM的错误检测和纠正机制。主要模块包括：答案生成模块、错误检测模块（基于PANL信号和口头置信度）、错误纠正模块。

关键创新：最重要的技术创新点在于发现了PANL信号在LLM自我纠错中的关键作用。与传统的一阶置信度模型不同，PANL信号代表了一种二阶评估机制，它能够独立于生成信号评估答案的正确性，并预测模型是否具备纠正错误的能力。

关键设计：论文的关键设计包括：(1) 使用TriviaQA和MNLI等数据集来评估LLM的错误检测和纠正能力；(2) 通过因果干预，例如腐化答案信息，来验证PANL信号在错误检测中的作用；(3) 对比PANL信号、口头置信度和token log-probabilities等指标，以评估它们在预测错误检测和纠正方面的能力；(4) 使用Gemma 3 27B和Qwen 2.5 7B等不同规模和架构的LLM来验证结果的泛化性。

🖼️ 关键图片

📊 实验亮点

实验结果表明，口头置信度预测错误检测的能力远超token log-probabilities，而PANL激活预测错误检测的能力超越口头置信度本身。更重要的是，PANL信号能够预测模型可以纠正哪些错误，而其他行为信号无法做到。因果干预实验进一步证实，当答案信息损坏时，PANL信号可以挽救错误检测行为。这些发现已在Gemma 3 27B和Qwen 2.5 7B等不同模型以及TriviaQA和MNLI等不同任务中得到验证。

🎯 应用场景

该研究成果可应用于提升大型语言模型的可靠性和准确性，尤其是在需要高精度输出的场景，如医疗诊断、金融分析和法律咨询等。通过理解和利用LLM的内部置信度信号，可以开发更有效的自我纠错机制，减少错误信息的传播，并提高人机交互的信任度。未来，可以进一步探索如何优化PANL信号，以实现更智能、更可靠的AI系统。

📄 摘要（原文）

Large language models can detect their own errors and sometimes correct them without external feedback, but the underlying mechanisms remain unknown. We investigate this through the lens of second-order models of confidence from decision neuroscience. In a first-order system, confidence derives from the generation signal itself and is therefore maximal for the chosen response, precluding error detection. Second-order models posit a partially independent evaluative signal that can disagree with the committed response, providing the basis for error detection. Kumaran et al. (2026) showed that LLMs cache a confidence representation at a token immediately following the answer (i.e. post-answer newline: PANL) -- that causally drives verbal confidence and dissociates from log-probabilities. Here we test whether this PANL signal extends beyond confidence to support error detection and self-correction. Here we test whether this signal supports error detection and self-correction, deriving predictions from the second-order framework. Using a verify-then-correct paradigm, we show that: (i) verbal confidence predicts error detection far beyond token log-probabilities, ruling out a first-order account; (ii) PANL activations predict error detection beyond verbal confidence itself; and (iii) PANL predicts which errors the model can correct -- where all behavioural signals fail. Causal interventions confirm that PANL signals rescue error detection behavior when answer information is corrupted. All findings replicate across models (Gemma 3 27B and Qwen 2.5 7B) and tasks (TriviaQA and MNLI). These results reveal that LLMs naturally implement a second-order confidence architecture whose internal evaluative signal encodes not only whether an answer is likely wrong but whether the model has the knowledge to fix it.

How LLMs Detect and Correct Their Own Errors: The Role of Internal Confidence Signals

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理