Token-Level LLM Collaboration via FusionRoute

作者: Nuoya Xiong, Yuhang Zhou, Hanqing Zeng, Zhaorun Chen, Furong Huang, Shuchao Bi, Lizhu Zhang, Zhuokai Zhao

分类: cs.AI, cs.CL, cs.LG

发布日期: 2026-01-08

备注: 25 pages

💡 一句话要点

提出FusionRoute，通过token级LLM协作提升多领域任务性能。

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 多模型协作 token级别路由 大型语言模型 模型融合 领域专家模型

📋 核心要点

通用LLM规模庞大，训练和部署成本高昂；领域专家模型泛化能力弱，难以跨领域应用。
FusionRoute通过轻量级路由器在token级别选择专家，并生成补充logit修正专家输出，实现高效协作。
实验表明，FusionRoute在数学推理、代码生成和指令跟随等任务上优于现有方法，并与领域专家模型竞争。

📝 摘要（中文）

大型语言模型(LLM)在不同领域展现出优势。然而，使用单一通用模型在这些领域都取得优异性能，通常需要扩展到训练和部署成本过高的规模。另一方面，虽然较小的领域专用模型效率更高，但它们难以泛化到训练分布之外。为了解决这个难题，我们提出了FusionRoute，一个鲁棒且有效的token级多LLM协作框架，其中轻量级路由器同时(i)在每个解码步骤选择最合适的专家，以及(ii)贡献一个补充logit，通过logit加法来改进或纠正所选专家的下一个token分布。与仅依赖固定专家输出的现有token级协作方法不同，我们提供了一个理论分析，表明纯粹的专家路由从根本上受到限制：除非满足强大的全局覆盖假设，否则它通常无法实现最优解码策略。通过用可训练的补充生成器增强专家选择，FusionRoute扩展了有效的策略类，并能够在温和条件下恢复最优价值函数。在Llama-3和Gemma-2系列以及涵盖数学推理、代码生成和指令遵循的各种基准测试中，FusionRoute优于序列级和token级协作、模型合并和直接微调，同时在各自的任务上与领域专家保持竞争力。

🔬 方法详解

问题定义：现有的大型语言模型要么是通用模型，参数量巨大，训练和部署成本高昂；要么是领域专家模型，虽然效率高，但是泛化能力差，无法很好地处理跨领域任务。现有的token级别协作方法依赖于固定的专家输出，存在理论上的局限性，无法实现最优的解码策略。

核心思路：FusionRoute的核心思路是通过一个轻量级的路由器，在每个token的解码步骤，动态地选择最合适的专家模型，并且生成一个补充的logit向量，用于修正专家模型的输出分布。这种方法结合了专家模型的专业性和路由器的灵活性，能够更好地适应不同的任务和输入。

技术框架：FusionRoute的整体架构包含多个专家模型和一个路由器。在每个解码步骤，路由器接收当前token的表示，然后执行以下操作：1) 选择一个专家模型；2) 生成一个补充的logit向量。最终的token分布是通过将所选专家的logit向量和补充logit向量相加得到的。路由器是一个轻量级的神经网络，可以使用Transformer或者MLP等结构。

关键创新：FusionRoute的关键创新在于引入了补充logit生成器。与传统的token级别协作方法不同，FusionRoute不仅选择专家模型，还通过生成补充logit向量来修正专家模型的输出。这种方法扩展了有效的策略空间，使得模型能够更好地逼近最优的解码策略。论文通过理论分析证明了纯粹的专家路由存在局限性，而引入补充logit生成器可以克服这些局限性。

关键设计：路由器使用一个小型Transformer网络，输入是当前token的embedding，输出是补充的logit向量。损失函数包括两部分：一部分是标准的交叉熵损失，用于训练专家模型和路由器；另一部分是一个正则化项，用于约束补充logit向量的幅度，防止路由器过度修正专家模型的输出。专家模型的选择可以使用top-k选择或者Gumbel-softmax等方法。

📊 实验亮点

实验结果表明，FusionRoute在Llama-3和Gemma-2系列模型上，在数学推理、代码生成和指令跟随等任务上，都优于序列级和token级协作、模型合并和直接微调等方法。例如，在某些任务上，FusionRoute的性能提升超过10%，并且与领域专家模型具有竞争力。

🎯 应用场景

FusionRoute可应用于需要多领域知识的任务，例如智能客服、多语言翻译、代码生成等。通过集成不同领域的专家模型，FusionRoute可以提供更全面、更准确的回答和生成结果。该方法还可以用于模型压缩和加速，通过选择合适的专家模型，减少计算量和内存占用。

📄 摘要（原文）

Large language models (LLMs) exhibit strengths across diverse domains. However, achieving strong performance across these domains with a single general-purpose model typically requires scaling to sizes that are prohibitively expensive to train and deploy. On the other hand, while smaller domain-specialized models are much more efficient, they struggle to generalize beyond their training distributions. To address this dilemma, we propose FusionRoute, a robust and effective token-level multi-LLM collaboration framework in which a lightweight router simultaneously (i) selects the most suitable expert at each decoding step and (ii) contributes a complementary logit that refines or corrects the selected expert's next-token distribution via logit addition. Unlike existing token-level collaboration methods that rely solely on fixed expert outputs, we provide a theoretical analysis showing that pure expert-only routing is fundamentally limited: unless strong global coverage assumptions hold, it cannot in general realize the optimal decoding policy. By augmenting expert selection with a trainable complementary generator, FusionRoute expands the effective policy class and enables recovery of optimal value functions under mild conditions. Empirically, across both Llama-3 and Gemma-2 families and diverse benchmarks spanning mathematical reasoning, code generation, and instruction following, FusionRoute outperforms both sequence- and token-level collaboration, model merging, and direct fine-tuning, while remaining competitive with domain experts on their respective tasks.

Token-Level LLM Collaboration via FusionRoute

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册