A$^2$FM: An Adaptive Agent Foundation Model for Tool-Aware Hybrid Reasoning
作者: Qianben Chen, Jingyi Cao, Jiayu Zhang, Tianrui Qin, Xiaowan Li, King Zhu, Dingfeng Shi, He Zhu, Minghao Liu, Xiaobo Liang, Xin Gui, Ge Zhang, Jian Yang, Yuchen Eleanor Jiang, Wangchunshu Zhou
分类: cs.CL, cs.AI
发布日期: 2025-10-13 (更新: 2025-10-21)
备注: 12 pages, 6 figures
💡 一句话要点
提出A$^2$FM以解决推理与工具调用效率低下问题
🎯 匹配领域: 支柱九:具身大模型 (Embodied Foundation Models)
关键词: 自适应模型 推理优化 工具调用 成本效率 任务感知
📋 核心要点
- 现有推理中心和代理型大型语言模型在效率和准确性上存在明显分歧,导致简单查询时的过度推理或工具调用。
- 本文提出的A$^2$FM模型采用路由-对齐原则,首先进行任务感知路由,然后在共享骨干网络下对齐不同模式的轨迹。
- 在32B规模下,A$^2$FM在多个基准上取得了显著的性能提升,尤其在成本效率上相较于现有方法降低了45.2%。
📝 摘要(中文)
大型语言模型分为两类:推理中心的LLM和代理型LLM。前者强化内部推理但无法调用外部工具,后者能与环境互动但推理能力较弱。本文提出自适应代理基础模型(A$^2$FM),通过任务感知路由和模式特定轨迹对齐,解决了两者的效率差距。引入的模式即时处理简单查询,避免不必要的推理或工具调用。实验结果显示,A$^2$FM在多个基准上设立了新的SOTA,并在成本效率上显著优于现有方法。
🔬 方法详解
问题定义:本文旨在解决推理中心和代理型大型语言模型在处理简单查询时的效率低下问题。现有方法在简单任务上往往过度推理或频繁调用工具,导致资源浪费。
核心思路:A$^2$FM模型通过引入任务感知路由和模式即时处理,优化了模型在不同任务下的响应效率,避免了不必要的推理和工具调用。
技术框架:A$^2$FM的整体架构包括三个主要模式:推理模式、代理模式和模式即时处理。模型首先根据任务类型进行路由,然后在共享的骨干网络中对齐不同模式的输出。
关键创新:最重要的创新在于引入了模式即时处理,直接处理简单查询,显著提高了效率并降低了成本。这一设计与现有方法的根本区别在于其动态适应性和任务导向性。
关键设计:在模型训练中,采用自适应策略优化(APO),通过成本正则化奖励来调整不同模式的采样策略,确保在不同任务下的高效执行。
🖼️ 关键图片
📊 实验亮点
在32B规模下,A$^2$FM在BrowseComp上取得了13.4%的准确率,在AIME25上达到了70.4%,在HLE上为16.7%。相较于推理和代理模型,A$^2$FM的每个正确答案的成本仅为$0.00487$,成本效率分别提高了45.2%和33.5%。
🎯 应用场景
A$^2$FM模型具有广泛的应用潜力,尤其在需要高效推理和工具调用的场景中,如智能助手、自动问答系统和复杂任务处理。其高效的成本管理和准确性使其在商业和科研领域均具备重要价值,未来可能推动更多智能应用的发展。
📄 摘要(原文)
Large language models split into two families: reasoning-centric LLMs, which strengthen internal chain-of-thought reasoning but cannot invoke external tools, and agentic LLMs, which learn to interact with environments and leverage tools but often lag in deep reasoning. This divide arises from fundamentally different training objectives, leading to mismatched strengths and inefficiency on simple queries, where both families tend to overthink or over-call tools. In this work, we present Adaptive Agent Foundation Model (A$^2$FM), a unified framework that follows a route-then-align principle: the model first learns task-aware routing and then aligns mode-specific trajectories under a shared backbone. To address the inefficiency gap, we introduce a third mode-instant-that handles simple queries directly, preventing unnecessary reasoning or tool calls while complementing the agentic and reasoning modes. To jointly enhance accuracy and efficiency, we propose Adaptive Policy Optimization (APO), which enforces adaptive sampling across modes and applies a cost-regularized reward. On the 32B scale, A$^2$FM achieves 13.4% on BrowseComp, 70.4% on AIME25, and 16.7% on HLE, setting new SOTA among comparable models and performing competitively with frontier LLMs across agentic, reasoning, and general benchmarks. Notably, the adaptive execution achieves a cost of pass of only $0.00487 per correct answer-cutting cost by 45.2% relative to reasoning and 33.5% relative to agentic, thus delivering substantially higher cost efficiency while maintaining comparable accuracy.