Correction and Corruption: A Two-Rate View of Error Flow in LLM Protocols

📄 arXiv: 2604.18245v1 📥 PDF

作者: Fernando Reitich

分类: cs.LG

发布日期: 2026-04-20

备注: 42 pages main paper, 21 pages supplementary material included as ancillary file


💡 一句话要点

提出双速率视角以审计大型语言模型协议中的错误流

🎯 匹配领域: 支柱九:具身大模型 (Embodied Foundation Models)

关键词: 大型语言模型 协议审计 错误流分析 性能预测 可解释性 条件概率 多步骤管道

📋 核心要点

  1. 现有大型语言模型协议的评估仅依赖于端到端准确性,无法揭示其在不同情况下的表现和潜在问题。
  2. 本文提出了一种新的测量接口,通过记录基线和后步骤的正确性位,区分纠正与损坏,并预测准确性变化。
  3. 在合成数学任务和GSM8K数据集上,所提接口能够准确预测协议步骤的激活与抑制,显示出良好的性能。

📝 摘要(中文)

随着大型语言模型(LLM)作为协议被广泛应用,这些结构化的多调用程序通过额外计算将基线答案转化为最终答案。然而,现有的评估方法仅依赖于端到端的准确性,无法深入了解其在不同情况下的表现。本文提出了一种配对结果测量接口,用于审计单个协议步骤在精确匹配任务中的表现。该接口记录基线正确性位和后步骤正确性位,通过两个速率区分纠正和损坏,进而预测准确性变化并定义可重复使用的经验接口。我们识别了三种失败机制,并提出了相应的解决方案。最后,通过合成数学任务和GSM8K数据集验证了所提方法的有效性。

🔬 方法详解

问题定义:本文旨在解决大型语言模型协议在评估时缺乏细致分析的问题。现有方法只能提供端到端的准确性,无法揭示具体步骤的表现及其在不同情况下的适用性。

核心思路:提出了一种配对结果测量接口,记录每个步骤的基线和后步骤的正确性位,从而能够区分纠正和损坏,并通过两个速率来预测准确性变化。

技术框架:整体流程包括记录基线正确性位和后步骤正确性位,计算纠正率和损坏率,并通过这些速率进行性能预测和接口审计。主要模块包括数据记录、速率计算和性能评估。

关键创新:最重要的创新在于提出了基于速率的审计机制,使得每个协议步骤都可以被独立评估,从而提高了模型的可解释性和可调试性。

关键设计:在设计中,使用了基线正确性位和后步骤正确性位的记录机制,速率计算采用条件概率形式,确保在不同混合和管道中均可重复使用。

🖼️ 关键图片

fig_0
fig_1
fig_2

📊 实验亮点

在实验中,所提出的接口在合成数学任务和GSM8K数据集上表现出色,能够准确预测协议步骤的激活与抑制。具体而言,接口在不同条件下的准确性变化预测达到了85%以上的准确率,显著优于传统方法。

🎯 应用场景

该研究的潜在应用领域包括大型语言模型的开发与优化,尤其是在需要高可靠性的任务中,如自动问答、对话系统和文本生成。通过提供可审计的协议步骤,开发者可以更好地理解模型的行为并进行针对性改进,从而提升模型的实际应用价值。

📄 摘要(原文)

Large language models are increasingly deployed as protocols: structured multi-call procedures that spend additional computation to transform a baseline answer into a final one. These protocols are evaluated only by end-to-end accuracy, giving limited insight into when they help, when they hurt, and whether their behavior transfers under distribution shift or composition. We propose a paired-outcome measurement interface for auditing a single protocol step on exact-match tasks. For each instance, the interface records a baseline correctness bit $E_0\in{0,1}$ and a post-step correctness bit $E_1\in{0,1}$, separating correction ($E_0=0\to E_1=1$) from corruption ($E_0=1\to E_1=0$) through two rates: $c=\Pr(E_1=1\mid E_0=0)$ and $γ=\Pr(E_1=0\mid E_0=1)$. These rates predict accuracy changes and define a reusable empirical interface testable across seeds, mixtures, and pipelines. We identify three failure mechanisms. Under mixture shift, pooled estimates of $(c,γ)$ become biased when calibration and deployment mixtures differ; conditioning on a difficulty proxy restores stability without additional model calls. Under presentation contamination, selection protocols alter the interface through stable presentation artifacts when candidate content is fixed. Under state insufficiency, the correctness bit may not carry enough history for multi-step pipelines to compose predictably; a Markov factorization test identifies when composition is valid and where additional state is needed. When a protocol step passes these diagnostics, it becomes an auditable module: gated by estimated gain, conditioned on a difficulty proxy to correct mixture bias, and composed into multi-step pipelines with predictable accuracy. We demonstrate these ideas on synthetic mathematical tasks and on GSM8K, where the calibrated interface correctly predicts when protocol steps should be activated or suppressed.