A Theory of Online Learning with Autoregressive Chain-of-Thought Reasoning

作者: Ilan Doron-Arad, Idan Mehalel, Elchanan Mossel

分类: cs.LG

发布日期: 2026-05-07

💡 一句话要点

建立自回归思维链学习的在线学习理论，揭示思维链对降低错误界限的关键作用

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 在线学习 自回归模型 思维链 错误界限 计算学习理论 大语言模型

📋 核心要点

核心问题：现有研究主要关注静态PAC学习，缺乏对自回归生成过程中在线学习错误界限的理论刻画，特别是生成视界M对学习复杂度的影响。
方法要点：构建在线学习框架，对比端到端反馈与思维链反馈，通过分析不同反馈机制下错误界限随生成视界M的增长规律，量化思维链的价值。
实验或效果：证明了端到端模型存在不可避免的对数级错误界限，而思维链模型能有效消除对生成视界M的依赖，并给出了线性阈值类的最优界限。

📝 摘要（中文）

自回归生成是大型语言模型的核心机制，可视为下一标记生成器的重复应用。本文在Joshi等人（2025）提出的PAC学习模型基础上，开发了自回归生成的在线学习框架，重点研究未知生成器诱导的最终输出的错误界限（mistake bound）。研究区分了两种反馈模式：端到端模型（仅观察最终输出）和思维链模型（观察完整的M步生成轨迹）。研究结果表明，在线自回归学习的错误界限依赖于生成视界M。在端到端模型中，错误界限的增长率在常数到对数之间，且对数上限不可避免；而在思维链模型中，访问完整轨迹可完全消除对M的依赖。此外，本文还分析了自回归线性阈值类，证明了最优错误界限，并解决了先前研究中遗留的若干理论问题。

🔬 方法详解

问题定义：论文旨在解决自回归生成模型在在线学习场景下的可学习性问题。核心挑战在于：当学习者仅能观察到最终输出（端到端）或完整推理路径（思维链）时，如何量化生成步数M对预测错误界限的影响。

核心思路：通过引入在线学习的错误界限分析框架，对比两种反馈机制。核心直觉是思维链（CoT）通过提供中间状态的监督信息，能够有效缓解因长序列生成带来的误差累积，从而降低学习难度。

技术框架：框架基于在线学习的博弈论视角，学习者在每一轮中根据当前模型预测输出，随后观察反馈（最终标记或完整轨迹）并更新模型。通过分析不同假设空间（如线性阈值类）在上述反馈下的Mistake Bound，推导其随M的变化趋势。

关键创新：首次在在线学习框架下建立了自回归生成的理论分类学，证明了端到端模型中错误界限随M增长的对数上限是本质的，并从理论上量化了思维链在消除这种依赖方面的决定性作用。

关键设计：采用了基于Mistake Bound的分析方法，通过对生成轨迹的分解，将复杂的多步生成问题转化为可处理的序列决策问题，并针对线性阈值类假设空间推导了最优的界限常数。

📊 实验亮点

研究证明了在端到端反馈下，错误界限随生成视界M呈现对数级增长，且该上限在数学上是不可避免的。相比之下，思维链模型通过引入中间轨迹反馈，成功将错误界限对M的依赖降至零。此外，针对自回归线性阈值类，论文给出了最优错误界限，并推导了统计学习设置下的新下界，完善了该领域的理论版图。

🎯 应用场景

该研究为大语言模型的推理能力提供了坚实的理论基础，特别是在复杂逻辑推理、代码生成及长文本规划任务中。其结论支持了思维链（CoT）提示工程的有效性，为设计更高效的在线微调算法、提升模型在动态环境下的适应能力提供了理论指导，有助于优化长序列生成任务的训练策略。

📄 摘要（原文）

Autoregressive generation lies at the heart of the mechanism of large language models. It can be viewed as the repeated application of a next-token generator: starting from an input string (prompt), the generator is applied for $M$ steps, and the last generated token is taken as the final output. [Joshi et al., 2025] proposed a PAC model for studying the learnability of the input-output maps arising from this process. We develop an online analogue of this framework, focusing on the mistake bound of learning the final output induced by an unknown next-token generator. We distinguish between two forms of feedback. In the End-to-End model, after each round the learner observes only the final token produced after $M$ autoregressive steps. In the Chain-of-Thought model, the learner is additionally shown the entire $M$-step trajectory. Our goal is to understand how the optimal mistake bound depends on the generation horizon $M$, and to what extent observing intermediate tokens can reduce this dependence. Our main results show that the online theory of autoregressive learning exhibits a qualitative picture analogous to the statistical one found by [Hanneke et al., 2026], but with a different scale of dependence on the generation horizon. In the End-to-End model, we prove a taxonomy of possible mistake-bound growth rates in the generation horizon $M$: essentially any rate between constant and logarithmic can arise. We further show that this logarithmic ceiling is unavoidable. In the Chain-of-Thought model, we show that access to the full generated trajectory eliminates the dependence on $M$ altogether. We also analyze autoregressive linear threshold classes, and prove optimal mistake bounds, as well as a new lower bound for the statistical setting. Along the way, our results resolve several questions left open by [Joshi et al., 2025].

A Theory of Online Learning with Autoregressive Chain-of-Thought Reasoning

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理