Quantitative Introspection in Language Models: Tracking Internal States Across Conversation

作者: Nicolas Martorell

分类: cs.AI

发布日期: 2026-03-19

💡 一句话要点

提出数值自报告以追踪语言模型的内部状态

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 语言模型 内部状态追踪 数值自报告 情感分析 对话系统 可解释性 模型优化

📋 核心要点

现有方法如线性探针在高维表示的压缩上存在不足，难以适应模型规模的增长。
论文提出通过模型的数值自报告追踪情感状态，灵感来源于人类心理学中的自我报告机制。
实验结果显示，基于logit的自报告能够有效追踪内部状态，且在不同模型中表现出良好的可扩展性。

📝 摘要（中文）

追踪大型语言模型在对话中的内部状态对于安全性、可解释性和模型福利至关重要，但现有方法存在局限性。本文借鉴人类心理学中的数值自报告，探讨语言模型的自报告是否能有效追踪情感状态。通过对40个十轮对话中四对概念（幸福感、兴趣、专注和冲动）的研究，发现基于logit的自报告能够更好地反映可解释的内部状态，并且这种能力随着对话的进行而演变。研究结果表明，数值自报告是追踪对话AI系统内部情感状态的有效工具。

🔬 方法详解

问题定义：本文旨在解决如何有效追踪大型语言模型在对话中的内部状态，现有方法如线性探针在高维表示的压缩上存在不足，难以适应模型规模的增长。

核心思路：论文的核心思路是借鉴人类心理学中的数值自报告，通过模型自身的自报告来追踪情感状态，探索其在对话中的演变和可解释性。

技术框架：整体架构包括四对概念的定义、十轮对话的设计、模型自报告的生成及其与探针定义的内部状态的因果关系分析。主要模块包括数据收集、模型训练和结果分析。

关键创新：最重要的技术创新点在于提出了基于logit的自报告方法，能够更准确地反映模型的内部状态，与现有的线性探针方法相比，提供了更高的可解释性和准确性。

关键设计：在实验中，设置了不同的参数以优化自报告的生成，使用了特定的损失函数来增强模型对情感状态的敏感性，并通过激活引导技术确认因果关系。实验还表明，随着模型规模的增加，性能有显著提升。

🖼️ 关键图片

📊 实验亮点

实验结果显示，基于logit的自报告能够有效追踪内部状态，Spearman相关系数达到0.40至0.76，R²值在0.12至0.54之间，且在LLaMA-3.1-8B-Instruct模型中接近0.93，表现出良好的可扩展性和准确性。

🎯 应用场景

该研究的潜在应用领域包括对话系统、情感分析和人机交互等。通过有效追踪模型的内部状态，可以提高对话系统的安全性和可解释性，进而提升用户体验。未来，该方法可能在情感计算和智能助手等领域发挥重要作用。

📄 摘要（原文）

Tracking the internal states of large language models across conversations is important for safety, interpretability, and model welfare, yet current methods are limited. Linear probes and other white-box methods compress high-dimensional representations imperfectly and are harder to apply with increasing model size. Taking inspiration from human psychology, where numeric self-report is a widely used tool for tracking internal states, we ask whether LLMs' own numeric self-reports can track probe-defined emotive states over time. We study four concept pairs (wellbeing, interest, focus, and impulsivity) in 40 ten-turn conversations, operationalizing introspection as the causal informational coupling between a model's self-report and a concept-matched probe-defined internal state. We find that greedy-decoded self-reports collapse outputs to few uninformative values, but introspective capacity can be unmasked by calculating logit-based self-reports. This metric tracks interpretable internal states (Spearman $ρ= 0.40$-$0.76$; isotonic $R^2 = 0.12$-$0.54$ in LLaMA-3.2-3B-Instruct), follows how those states change over time, and activation steering confirms the coupling is causal. Furthermore, we find that introspection is present at turn 1 but evolves through conversation, and can be selectively improved by steering along one concept to boost introspection for another ($ΔR^2$ up to $0.30$). Crucially, these phenomena scale with model size in some cases, approaching $R^2 \approx 0.93$ in LLaMA-3.1-8B-Instruct, and partially replicate in other model families. Together, these results position numeric self-report as a viable, complementary tool for tracking internal emotive states in conversational AI systems.

Quantitative Introspection in Language Models: Tracking Internal States Across Conversation

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理