CLP: Collocation-Length Prediction for Zero-Loss Adaptive Multi-Token Inference

作者: Xuezhen Xie, Zhiqiang Zhou

分类: cs.LG, cs.AI

发布日期: 2026-06-09

备注: 13 pages, 8 figures, 8 tables

💡 一句话要点

提出CLP以解决自回归解码中的多标记推理问题

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 多标记预测 自回归解码 语言模型 推理加速 深度学习

📋 核心要点

现有的多标记预测方法在自回归解码中存在首个标记的MTP头与主干LM头竞争的问题，导致输出质量下降。
本文提出了Backbone-as-Architect设计原则，确保主干LM头生成首个标记，MTP头负责后续标记，并引入CLP进行额外标记数量的预测。
实验结果显示，CLP在1.5B和7B模型上实现了1.20x至1.29x的加速，且输出质量无下降，显著优于基于门控的先前方法。

📝 摘要（中文）

大规模语言模型推理受到自回归解码的瓶颈限制，每个标记都需要完整的前向传递。多标记预测（MTP）提供了加速的潜在路径，但现有方法存在根本的架构缺陷：首个标记的MTP头与主干语言模型头竞争，导致接受预测时质量严重下降。为了解决这一问题，本文提出了Backbone-as-Architect设计原则，确保主干LM头始终生成首个标记，而MTP头仅负责后续标记。基于此原则，我们引入了CLP（Collocation-Length Predictor），一个轻量级的跨度级决策层，用于预测每个解码步骤可以安全接受的额外标记数量。实验结果表明，CLP在不同规模的Qwen2.5模型上实现了显著加速，同时保持了输出质量。

🔬 方法详解

问题定义：本文旨在解决现有多标记预测方法中首个标记的MTP头与主干语言模型头之间的竞争问题，这种竞争导致了输出的重复性和不连贯性，影响了推理质量。

核心思路：论文提出了Backbone-as-Architect设计原则，确保主干LM头始终负责生成首个标记，而MTP头仅负责后续标记的生成，从而消除头部之间的竞争。

技术框架：整体架构包括主干语言模型和MTP头，CLP作为决策层介入，负责在每个解码步骤预测可以安全接受的额外标记数量。CLP仅使用一个线性层，参数量大幅减少。

关键创新：CLP的引入是本文的主要创新点，相比于以往使用的复杂门控网络，CLP显著简化了模型结构，同时保持了推理质量。

关键设计：CLP的设计使用了4.6K至7.7K的参数，替代了以往1M参数的门控网络，实验表明在不同模型上均实现了加速且无质量下降，展示了更高的MTP头预测准确性。

🖼️ 关键图片

📊 实验亮点

实验结果表明，CLP在1.5B模型上实现了1.20x至1.29x的加速，在7B模型上实现了1.14x至1.20x的加速，且重复率低于0.02%，相比之下，基于门控的方法仅实现了1.07x的加速，且输出质量严重下降，重复率超过0.5%。

🎯 应用场景

该研究的潜在应用领域包括自然语言处理中的实时对话系统、文本生成和机器翻译等场景。通过提高推理速度和保持输出质量，CLP可以在大规模语言模型的实际应用中发挥重要作用，推动智能助手和自动化内容生成的发展。

📄 摘要（原文）

Large language model inference is bottlenecked by autoregressive decoding, where each token requires a full forward pass. Multi-token prediction (MTP) offers a promising acceleration path, but existing approaches suffer from a fundamental architectural flaw: the MTP head for the first token competes with the backbone's own language model (LM) head, leading to severe quality degradation when predictions are accepted. We identify this head-backbone competition as the root cause of repetitive and incoherent outputs in prior MTP-based acceleration methods. To address this, we propose Backbone-as-Architect, a design principle where the backbone LM head always generates the first token, and MTP heads are responsible only for subsequent tokens. Building on this principle, we introduce CLP (Collocation-Length Predictor), a lightweight span-level decision layer that predicts how many additional tokens can be safely accepted at each decoding step. CLP uses only a single linear layer (4.6K--7.7K parameters), replacing the over-engineered 1M-parameter gate networks used in prior work. Experiments on Qwen2.5 models (0.5B, 1.5B, 7B) show that CLP achieves 1.20x--1.29x speedup on 1.5B and 1.14x--1.20x on 7B, with zero quality degradation (repetition ratio < 0.02), while gate-based approaches fail to accelerate (1.07x) or produce severely degraded outputs (repetition ratio > 0.5%). We further demonstrate that shorter prediction horizons (k=2) recover 24% higher MTP head accuracy on large models, establishing a scaling-aware design principle. We identify MTP head prediction accuracy as the binding constraint on acceleration and establish a clear roadmap for future improvements.

CLP: Collocation-Length Prediction for Zero-Loss Adaptive Multi-Token Inference

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理