Reasoning about Uncertainty: Do Reasoning Models Know When They Don't Know?

📄 arXiv: 2506.18183v3 📥 PDF

作者: Zhiting Mei, Christina Zhang, Tenny Yin, Justin Lidard, Ola Shorinwa, Anirudha Majumdar

分类: cs.AI, cs.CL

发布日期: 2025-06-22 (更新: 2025-07-18)


💡 一句话要点

提出不确定性量化方法以提升推理模型的可信度

🎯 匹配领域: 支柱二:RL算法与架构 (RL & Architecture) 支柱九:具身大模型 (Embodied Foundation Models)

关键词: 推理模型 不确定性量化 模型校准 内省机制 自信度评估

📋 核心要点

  1. 推理模型在生成回答时常常表现出过度自信,尤其是在错误回答的情况下,导致信任度不足。
  2. 提出内省不确定性量化方法,通过分析推理过程中的思维链条来改善模型的校准能力。
  3. 实验结果表明,部分推理模型在内省后校准得到了改善,但也有模型表现出更差的校准情况。

📝 摘要(中文)

推理语言模型在多项挑战性基准测试中取得了最先进的成绩,但仍然容易生成自信且似是而非的错误回答。了解何时以及在多大程度上信任这些模型对于其在实际应用中的安全部署至关重要。本文探讨了推理模型的不确定性量化,提出了内省不确定性量化方法,评估推理模型的校准情况,并发现推理模型通常过于自信,且更深层次的推理会导致更高的自信度。通过内省,部分模型的校准得到了改善,但并非所有模型均如此。最后,提出了设计必要的不确定性量化基准和改善推理模型校准的重要研究方向。

🔬 方法详解

问题定义:本文旨在解决推理模型在生成回答时的过度自信问题,现有方法未能有效量化模型的不确定性,导致错误回答的信任度不足。

核心思路:提出内省不确定性量化方法,鼓励模型在生成回答时反思其推理过程,从而提高校准能力。通过这种方式,模型可以更好地识别其自信度与实际正确性之间的差距。

技术框架:整体架构包括三个主要模块:推理模型生成回答、内省机制分析推理链条、以及不确定性量化评估。模型首先生成回答,然后通过内省机制对推理过程进行自我检查,最后评估其自信度与实际正确性。

关键创新:最重要的创新在于引入内省机制,使推理模型能够自我反思其推理过程,从而改善其校准能力。这一方法与传统的单一输出模型显著不同。

关键设计:在模型设计中,采用了自我评估的机制,设置了相应的损失函数以鼓励模型在内省时更准确地反映其自信度,同时对网络结构进行了优化,以支持内省过程的高效执行。

📊 实验亮点

实验结果显示,推理模型在自信度评估中通常超过85%,尤其是在错误回答时表现出更高的自信。通过内省,部分模型如o3-Mini和DeepSeek R1的校准得到了显著改善,而Claude 3.7 Sonnet则表现出校准能力下降。这表明内省方法的有效性并非对所有模型均适用。

🎯 应用场景

该研究的潜在应用领域包括自动问答系统、智能助手和决策支持系统等。通过提升推理模型的可信度,可以在医疗、金融和法律等高风险领域中更安全地部署这些技术,减少错误决策的风险,增强用户信任。

📄 摘要(原文)

Reasoning language models have set state-of-the-art (SOTA) records on many challenging benchmarks, enabled by multi-step reasoning induced using reinforcement learning. However, like previous language models, reasoning models are prone to generating confident, plausible responses that are incorrect (hallucinations). Knowing when and how much to trust these models is critical to the safe deployment of reasoning models in real-world applications. To this end, we explore uncertainty quantification of reasoning models in this work. Specifically, we ask three fundamental questions: First, are reasoning models well-calibrated? Second, does deeper reasoning improve model calibration? Finally, inspired by humans' innate ability to double-check their thought processes to verify the validity of their answers and their confidence, we ask: can reasoning models improve their calibration by explicitly reasoning about their chain-of-thought traces? We introduce introspective uncertainty quantification (UQ) to explore this direction. In extensive evaluations on SOTA reasoning models across a broad range of benchmarks, we find that reasoning models: (i) are typically overconfident, with self-verbalized confidence estimates often greater than 85% particularly for incorrect responses, (ii) become even more overconfident with deeper reasoning, and (iii) can become better calibrated through introspection (e.g., o3-Mini and DeepSeek R1) but not uniformly (e.g., Claude 3.7 Sonnet becomes more poorly calibrated). Lastly, we conclude with important research directions to design necessary UQ benchmarks and improve the calibration of reasoning models.