Monitoring Monitorability

📄 arXiv: 2512.18311v1 📥 PDF

作者: Melody Y. Guan, Miles Wang, Micah Carroll, Zehao Dou, Annie Y. Wei, Marcus Williams, Benjamin Arnav, Joost Huizinga, Ian Kivlichan, Mia Glaese, Jakub Pachocki, Bowen Baker

分类: cs.AI, cs.SE

发布日期: 2025-12-20


💡 一句话要点

提出监控可监控性以提升AI系统的安全性

🎯 匹配领域: 支柱二:RL算法与架构 (RL & Architecture) 支柱九:具身大模型 (Embodied Foundation Models)

关键词: 可监控性 思维链 AI安全 模型评估 强化学习 决策监控 系统评估

📋 核心要点

  1. 现有方法在不同训练程序和数据源下的可监控性脆弱,影响AI系统的安全性。
  2. 提出三种评估原型和新的可监控性指标,以系统性地评估和跟踪AI模型的可监控性。
  3. 实验结果显示,CoT监控比行为监控更有效,且长思维链通常具有更高的可监控性。

📝 摘要(中文)

现代AI系统的决策过程需要可观察性,以确保安全部署。本文提出了监控思维链(CoT)的方法,尽管其在不同训练程序和数据源下的可监控性可能脆弱。为此,作者提出了三种评估原型(干预、过程和结果属性)及新的可监控性指标,并引入了广泛的评估套件。研究表明,CoT监控在实际应用中比仅监控行为更有效,并且大多数前沿模型的可监控性较好但并不完美。长的思维链通常更具可监控性,而强化学习优化不会显著降低可监控性。

🔬 方法详解

问题定义:本文旨在解决现代AI系统在不同训练条件下的可监控性脆弱性问题,现有方法难以有效监控AI的决策过程。

核心思路:通过引入三种评估原型(干预、过程和结果属性)和新的可监控性指标,系统性地评估AI模型的思维链,提升监控效果。

技术框架:整体架构包括三个主要模块:评估原型、可监控性指标计算和评估套件,旨在全面评估模型的可监控性。

关键创新:提出的可监控性评估方法和指标是对现有监控方法的重大改进,能够更有效地捕捉模型的思维链。

关键设计:在设计中,关注思维链的长度、强化学习优化的影响以及监控者的计算能力,确保在不同条件下的可监控性评估。

🖼️ 关键图片

fig_0
fig_1
fig_2

📊 实验亮点

实验结果表明,CoT监控在实际应用中比仅监控行为更有效,且大多数前沿模型的可监控性较好,长思维链的可监控性显著高于短思维链。强化学习优化对可监控性影响不大,且在低推理努力的模型中,使用小模型但高推理努力可提高可监控性。

🎯 应用场景

该研究的潜在应用领域包括自动驾驶、医疗诊断和金融决策等高风险场景,能够提升AI系统的安全性和可靠性。通过有效的监控机制,能够及时发现和纠正AI系统的错误决策,降低潜在风险。

📄 摘要(原文)

Observability into the decision making of modern AI systems may be required to safely deploy increasingly capable agents. Monitoring the chain-of-thought (CoT) of today's reasoning models has proven effective for detecting misbehavior. However, this "monitorability" may be fragile under different training procedures, data sources, or even continued system scaling. To measure and track monitorability, we propose three evaluation archetypes (intervention, process, and outcome-property) and a new monitorability metric, and introduce a broad evaluation suite. We demonstrate that these evaluations can catch simple model organisms trained to have obfuscated CoTs, and that CoT monitoring is more effective than action-only monitoring in practical settings. We compare the monitorability of various frontier models and find that most models are fairly, but not perfectly, monitorable. We also evaluate how monitorability scales with inference-time compute, reinforcement learning optimization, and pre-training model size. We find that longer CoTs are generally more monitorable and that RL optimization does not materially decrease monitorability even at the current frontier scale. Notably, we find that for a model at a low reasoning effort, we could instead deploy a smaller model at a higher reasoning effort (thereby matching capabilities) and obtain a higher monitorability, albeit at a higher overall inference compute cost. We further investigate agent-monitor scaling trends and find that scaling a weak monitor's test-time compute when monitoring a strong agent increases monitorability. Giving the weak monitor access to CoT not only improves monitorability, but it steepens the monitor's test-time compute to monitorability scaling trend. Finally, we show we can improve monitorability by asking models follow-up questions and giving their follow-up CoT to the monitor.