MIPT-SSM: Scaling Language Models with $O(1)$ Inference Cache via Phase Transitions

作者: Yasong Fan

分类: cs.LG

发布日期: 2026-04-09

备注: 6 pages, 8 tables

💡 一句话要点

提出MIPT-SSM以解决语言模型推理效率问题

🎯 匹配领域: 支柱二：RL算法与架构 (RL & Architecture)

关键词: 语言模型 测量诱导相变 推理效率 内存优化 序列建模

📋 核心要点

现有语言模型在推理过程中面临高内存消耗和低效率的问题，限制了其应用范围。
MIPT-SSM通过引入测量率$p_{t}$，在波相和粒子相之间切换计算方式，从而优化信息存储和处理。
在多个实验中，MIPT在准确率和内存使用上均显著优于传统的Transformer模型，展示了其有效性。

📝 摘要（中文）

我们提出了MIPT-SSM，这是一种基于测量诱导相变（MIPT）物理学的神经序列架构。其核心思想是学习测量率$p_{t} ext{∈}(0,1)$，在波相和粒子相之间路由计算。模型在关键序列长度$N^{*} ext{≈}1024$处预测出现相变，信息密度比$N/D$跨越单位。MIPT在AG News数据集上取得了0.905的准确率，相比于Transformer的0.736提升了16.6%。在内存使用上，MIPT在$N=8192$时仅需810 MB，而Transformer则需34,651 MB，减少了42.8倍。

🔬 方法详解

问题定义：现有的语言模型在推理时通常需要大量内存，导致效率低下，尤其是在处理长序列时。传统的序列建模方法在信息存储和处理上存在局限性，无法有效应对复杂的输入数据。

核心思路：MIPT-SSM的核心思想是通过学习测量率$p_{t}$来动态调整计算模式，在波相（信息以分布式复杂相位干涉传播）和粒子相（状态集中于当前标记）之间切换，从而提高信息处理的灵活性和效率。

技术框架：MIPT-SSM的整体架构包括两个主要阶段：首先，根据当前的测量率$p_{t}$选择合适的计算模式；其次，在粒子相中进行精确的局部存储。该模型通过相变理论来优化信息的存储和检索过程。

关键创新：MIPT-SSM的主要创新在于引入了测量诱导相变的概念，解决了传统序列建模中的“禁忌定理”问题，使得在同一线性算子中实现波相和粒子相的动态切换成为可能。

关键设计：模型中设置了学习的测量率$p_{t}$，并在实验中观察到在特定序列长度下的相变现象。此外，模型在内存使用上进行了优化，使得在大规模输入时仍能保持高效的推理能力。

📊 实验亮点

在AG News数据集上，MIPT-SSM达到了0.905的准确率，相比于Transformer的0.736提升了16.6%。在内存使用方面，MIPT在序列长度为8192时仅需810 MB，而Transformer则需34,651 MB，实现了42.8倍的内存减少。此外，在精确回忆任务中，MIPT的因果稀疏KV缓存达到了0.968的准确率，展现了其强大的信息存储能力。

🎯 应用场景

MIPT-SSM的研究成果在自然语言处理、信息检索和机器翻译等领域具有广泛的应用潜力。通过优化内存使用和提高推理效率，该模型能够支持更大规模的数据处理，推动智能应用的发展，尤其是在需要实时响应的场景中。

📄 摘要（原文）

We present MIPT-SSM, a neural sequence architecture built on the physics of Measurement-Induced Phase Transitions (MIPT). The central idea is a learned measurement rate $p_{t}\in(0,1)$ that routes computation between two regimes: wave phase $(p_{t}\rightarrow0)$, where information propagates as distributed complex-phase interference; and particle phase $(p_{t}\rightarrow1)$ where the state collapses onto the current token, enabling precise local storage. These two regimes are provably incompatible in a single linear operator one of the few "no-go theorems" in sequence modeling and $p_{t}$ is our way around it. The model is predicted to exhibit a phase transition at critical sequence length $N^{*}\approx1024$, where the information density ratio $N/D$ crosses unity, consistent with our memory scaling observations. On AG News (four-class classification), MIPT achieves 0.905 accuracy versus Transformer's 0.736 (+16.6%), stable across 3 seeds. At $N=8192$ MIPT requires 810 MB versus Transformer's 34,651 MB a 42.8x memory reduction. On exact-recall ("needle-in-a-haystack"), our causal sparse KV cache achieves 0.968 accuracy. Remarkably, under unbounded cache capacity, the $p_{t}$ gate autonomously learns to store only the single critical token (averaging $1.0/512$ slots used), filtering out all noise and achieving a 99.8% sparsity rate. On language modeling (WikiText-103, 31M parameters), MIPT-LM with $K=64$ cache reaches PPL 92.1 versus Transformer's 90.5 (gap: 1.8%) while inference KV cache shrinks from $O(N)$ to $O(64)$.

MIPT-SSM: Scaling Language Models with $O(1)$ Inference Cache via Phase Transitions

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理