Intrinsic Stability Limits of Autoregressive Reasoning: Structural Consequences for Long-Horizon Execution

作者: Hsien-Jyh Liao

分类: cs.AI

发布日期: 2026-02-06

备注: 16 Pages, 7 figures, Keyworda: Autoregressive Reasoning, Long-Horizon Stability, Chain-of-Thought Reasoning, Information-Theoretic Analysis, Structured Reasoning, Inference Dynamics

💡 一句话要点

揭示自回归推理的内在稳定性极限，提出长程执行的结构性治理方案

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 自回归推理 长程依赖 稳定性分析 结构治理 语言模型 TextWorld 性能悬崖

📋 核心要点

现有方法难以解释LLM在简单长程任务中的性能衰退，通常归因于任务复杂性。
论文提出自回归推理存在内在稳定性极限，决策优势随执行长度指数衰减，需要结构性治理。
实验在合成环境和TextWorld中验证了理论预测，揭示了性能悬崖现象，并强调了结构治理的重要性。

📝 摘要（中文）

大型语言模型（LLMs）展现了卓越的推理能力，但其性能在长程任务中常常急剧下降，并在特定规模下表现出系统性崩溃。传统的解释主要将这种现象归因于任务复杂性，例如组合搜索爆炸或长期信用分配挑战。本文认为这些解释是不完整的：即使在线性、无分支且无语义歧义的任务中，自回归执行也受到内在稳定性极限的约束。我们提出，长程推理的根本约束源于自回归生成过程中的过程级不稳定，而不仅仅是搜索或任务复杂性，从而将长程推理重新定义为一个结构治理问题。我们推导了定理A，表明单路径自回归推理中的决策优势随执行长度呈指数衰减，从而对可维护的推理链施加了根本限制。这一结果暗示了一个结构性结果：稳定的长程推理需要离散分割，自然地诱导了如图形（例如有向无环图（DAG））的执行结构。在合成环境和真实的TextWorld任务中的实证研究揭示了与理论预测一致的可观察到的性能悬崖。我们的发现为长程推理失败提供了一个动态视角，并表明了对纯自回归架构下维持长期连贯性的新限制。此外，我们强调短程评估协议可能会掩盖结构不稳定性，表明未来推理系统可能会从扩展转向结构化治理。

🔬 方法详解

问题定义：大型语言模型在长程推理任务中表现出性能衰退，即使在简单的线性无分支任务中也存在。现有方法主要关注任务本身的复杂性，如组合爆炸和长期依赖，忽略了自回归生成过程本身的内在不稳定性。因此，需要研究自回归推理的内在稳定性极限，并探索如何克服这一限制。

核心思路：论文的核心思路是将长程推理视为一个结构治理问题，认为自回归生成过程中的过程级不稳定是导致性能衰退的关键因素。通过理论分析，证明了决策优势随执行长度指数衰减，从而需要引入离散分割和图结构来维持长期连贯性。

技术框架：论文主要通过理论推导和实验验证来支持其观点。首先，通过定理A证明了单路径自回归推理中的决策优势随执行长度呈指数衰减。然后，在合成环境和真实的TextWorld任务中进行实验，观察性能悬崖现象，并验证理论预测。

关键创新：论文最重要的技术创新点在于揭示了自回归推理的内在稳定性极限，并提出了结构治理的概念。与现有方法不同，论文关注自回归生成过程本身的不稳定性，而不是仅仅关注任务的复杂性。

关键设计：论文的关键设计包括：1）定理A的推导，用于量化决策优势随执行长度的衰减；2）合成环境的设计，用于控制任务的复杂性，以便更好地观察自回归推理的内在不稳定性；3）TextWorld任务的选择，用于在更真实的场景中验证理论预测。

📊 实验亮点

实验结果表明，在合成环境中，性能随着推理步数的增加呈现明显的悬崖式下降，验证了定理A的预测。在TextWorld任务中，也观察到了类似的性能悬崖现象，表明自回归推理的内在稳定性极限在真实场景中同样存在。这些结果强调了结构治理对于长程推理的重要性。

🎯 应用场景

该研究成果可应用于提升大型语言模型在长文本生成、对话系统、机器人控制等领域的性能。通过引入结构化推理和离散分割，可以有效缓解长程依赖问题，提高生成文本的连贯性和一致性。未来的研究可以探索更有效的结构治理方法，例如分层推理、知识图谱融合等。

📄 摘要（原文）

Large language models (LLMs) demonstrate remarkable reasoning capabilities, yet their performance often deteriorates sharply in long-horizon tasks, exhibiting systematic breakdown beyond certain scales. Conventional explanations primarily attribute this phenomenon to task complexity, such as combinatorial search explosion or long-term credit assignment challenges. In this work, we argue that these explanations are incomplete: even in linear, unbranched tasks without semantic ambiguity, autoregressive execution is subject to an intrinsic stability limit. We propose that the fundamental constraint on long-horizon reasoning arises from process-level instability in autoregressive generation rather than solely from search or task complexity, reframing long-horizon reasoning as a problem of structural governance. We derive Theorem~A, showing that decision advantage in single-path autoregressive reasoning decays exponentially with execution length, imposing a fundamental bound on maintainable reasoning chains. This result implies a structural consequence: stable long-horizon reasoning requires discrete segmentation, naturally inducing graph-like execution structures such as directed acyclic graphs (DAGs). Empirical studies in both synthetic environments and real TextWorld tasks reveal observable performance cliffs consistent with theoretical predictions. Our findings provide a dynamical perspective on long-horizon reasoning failure and suggest new limitations on maintaining long-term coherence under purely autoregressive architectures. Furthermore, we highlight that short-horizon evaluation protocols may obscure structural instability, indicating a potential shift from scaling toward structured governance in future reasoning systems.

Intrinsic Stability Limits of Autoregressive Reasoning: Structural Consequences for Long-Horizon Execution

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理