TMAS: Scaling Test-Time Compute via Multi-Agent Synergy

📄 arXiv: 2605.10344v1 📥 PDF

作者: George Wu, Nan Jing, Qing Yi, Chuan Hao, Ming Yang, Feng Chang, Yuan Wei, Jian Yang, Ran Tao, Bryan Dai

分类: cs.AI

发布日期: 2026-05-11

🔗 代码/项目: GITHUB


💡 一句话要点

提出TMAS框架,通过多智能体协同与分层记忆机制实现推理阶段计算规模化

🎯 匹配领域: 支柱二:RL算法与架构 (RL & Architecture) 支柱九:具身大模型 (Embodied Foundation Models)

关键词: 大语言模型 推理阶段计算 多智能体协同 强化学习 分层记忆 思维链优化

📋 核心要点

  1. 现有推理扩展方法在多轨迹协作上协调性较弱,且缺乏对历史信息的有效筛选与复用,导致推理过程中的冗余与探索效率低下。
  2. TMAS通过多智能体协同框架,引入分层记忆机制(经验库与指南库),实现跨轨迹、跨迭代的结构化信息流转与策略优化。
  3. 实验表明,TMAS在复杂推理任务中显著优于现有基线,混合奖励机制进一步提升了模型在多轮迭代中的扩展效果与稳定性。

📝 摘要(中文)

推理阶段计算规模化(Test-time scaling)已成为提升大语言模型推理能力的重要范式。现有方法在多轨迹推理、细化迭代及验证反馈的协同上存在不足,且缺乏对历史信息的有效筛选与复用,难以平衡探索与利用。本文提出TMAS框架,将推理过程组织为专业智能体间的协作。TMAS引入分层记忆机制:经验库用于复用可靠的中间结论与局部反馈,指南库用于记录高层策略以避免冗余推理。此外,本文设计了混合奖励强化学习方案,旨在保持基础推理能力的同时,提升经验利用率并鼓励策略探索。在多项复杂推理基准测试中,TMAS展现出优于现有基线的迭代扩展能力与训练稳定性。

🔬 方法详解

问题定义:论文旨在解决大模型在推理阶段(Test-time)进行计算扩展时,多轨迹推理缺乏深度协同、历史信息利用率低以及探索与利用难以平衡的问题。

核心思路:将推理过程建模为多智能体协同任务,通过显式的记忆管理机制,将推理过程中的低层结论与高层策略进行解耦存储,从而实现推理轨迹的结构化优化。

技术框架:TMAS构建了一个多智能体协作系统,包含负责具体推理的智能体和管理记忆的模块。系统通过分层记忆机制进行信息交互:经验库(Experience Bank)存储可靠的中间结论,指南库(Guideline Bank)记录高层推理策略,指导后续推理路径。

关键创新:引入分层记忆架构,实现了从“盲目搜索”到“策略引导”的转变。与传统方法相比,TMAS不仅复用局部结果,更通过指南库显式地规避了已知的无效推理模式,实现了更高效的搜索空间探索。

关键设计:设计了混合奖励强化学习(Hybrid Reward RL)方案,该方案结合了基础推理准确性奖励、经验利用效率奖励以及策略探索奖励,通过多目标优化确保模型在迭代过程中既能保持逻辑严密性,又能不断发现新的解题路径。

🖼️ 关键图片

fig_0
fig_1
fig_2

📊 实验亮点

在多个主流复杂推理基准测试中,TMAS表现出显著的性能增益。实验结果显示,TMAS在迭代扩展效率上优于现有的思维链(CoT)及多轨迹搜索基线。通过引入混合奖励训练,模型在多轮推理迭代中展现出更强的稳定性,有效缓解了长链条推理中的性能退化问题,证明了分层记忆机制在提升推理质量方面的有效性。

🎯 应用场景

该研究适用于需要高逻辑严密性的复杂推理场景,如数学证明、代码生成、科学研究中的假设推演及复杂决策规划。通过提升推理阶段的计算效率,TMAS能够显著增强大模型在处理长链条、多步骤任务时的准确性与鲁棒性,对提升AI在专业领域的落地应用价值具有重要意义。

📄 摘要(原文)

Test-time scaling has become an effective paradigm for improving the reasoning ability of large language models by allocating additional computation during inference. Recent structured approaches have further advanced this paradigm by organizing inference across multiple trajectories, refinement rounds, and verification-based feedback. However, existing structured test-time scaling methods either weakly coordinate parallel reasoning trajectories or rely on noisy historical information without explicitly deciding what should be retained and reused, limiting their ability to balance exploration and exploitation. In this work, we propose TMAS, a framework for scaling test-time compute via multi-agent synergy. TMAS organizes inference as a collaborative process among specialized agents, enabling structured information flow across agents, trajectories, and refinement iterations. To support effective cross-trajectory collaboration, TMAS introduces hierarchical memories: the experience bank reuses low-level reliable intermediate conclusions and local feedback, while the guideline bank records previously explored high-level strategies to steer subsequent rollouts away from redundant reasoning patterns. Furthermore, we design a hybrid reward reinforcement learning scheme tailored to TMAS, which jointly preserves basic reasoning capability, enhances experience utilization, and encourages exploration beyond previously attempted solution strategies. Extensive experiments on challenging reasoning benchmarks demonstrate that TMAS achieves stronger iterative scaling than existing test-time scaling baselines, while hybrid reward training further improves scaling effectiveness and stability across iterations. Code and data are available at https://github.com/george-QF/TMAS-code.