Tool Attention Is All You Need: Dynamic Tool Gating and Lazy Schema Loading for Eliminating the MCP/Tools Tax in Scalable Agentic Workflows

作者: Anuj Sadani, Deepak Kumar

分类: cs.AI

发布日期: 2026-04-23

备注: 21 pages

🔗 代码/项目: GITHUB

💡 一句话要点

提出Tool Attention机制，通过动态工具门控和延迟模式加载，消除可扩展Agent工作流中的MCP/Tools Tax。

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: Agent系统 工具注意力 模型上下文协议 延迟模式加载 动态工具门控

📋 核心要点

现有MCP协议在Agent与工具交互时，存在因急切加载工具模式而导致的显著token开销，影响效率和推理质量。
Tool Attention通过意图匹配、状态感知门控和延迟加载，选择性地加载工具模式，减少不必要的token消耗。
实验表明，Tool Attention能显著降低每轮token数量，提高上下文利用率，并提升任务成功率和推理质量（预测）。

📝 摘要（中文）

模型上下文协议（MCP）已成为连接大型语言模型（LLM）Agent与外部工具的常用接口，但其对无状态、急切模式注入的依赖带来了一种隐藏的每轮开销，即MCP Tax或Tools Tax。从业者报告称，在典型的多服务器部署中，这种开销大约在1万到6万个token之间。这种payload膨胀了键值缓存，并与上下文利用率接近70%左右的已发布断裂点时的推理能力下降有关，并将token预算转化为经常性的运营成本。我们引入了Tool Attention，这是一种中间件层机制，它将“Attention Is All You Need”范式从token上的自注意力推广到工具上的门控注意力。Tool Attention结合了（i）来自句子嵌入的意图模式重叠（ISO）分数，（ii）强制执行先决条件和访问范围的状态感知门控函数，以及（iii）两阶段延迟模式加载器，该加载器在上下文中保持紧凑的摘要池，并且仅为前k个门控工具提升完整的JSON模式。我们在一个模拟的120个工具、六服务器的基准上进行了评估，该基准的每服务器token计数已校准为真实MCP部署的公开审计。在这个模拟中，Tool Attention直接将测量的每轮工具token减少了95.0%（47.3k -> 2.4k），并将有效上下文利用率（一种token比率量）从24%提高到91%。任务成功率、延迟、成本和推理质量的端到端数据被报告为从测量的token计数与已发布的部署遥测数据相结合得出的预测；它们不是在实时LLM Agent上测量的，我们在整个过程中明确标记了预测值。总而言之，结果支持一个简单的论点：协议级别的效率，而不是原始上下文长度，是可扩展Agent系统的约束。

🔬 方法详解

问题定义：论文旨在解决大型语言模型Agent与外部工具交互时，由于模型上下文协议（MCP）的低效性而产生的“Tools Tax”问题。现有方法，特别是基于MCP的方案，通常采用无状态、急切的模式注入方式，导致每次交互都需要加载大量工具的模式信息，造成token浪费，增加计算负担，并可能降低推理质量。这种“Tools Tax”成为Agent系统扩展性的瓶颈。

核心思路：论文的核心思路是将注意力机制从token层面扩展到工具层面，通过“Tool Attention”机制动态地选择和加载工具模式。该机制的核心在于根据Agent的意图和当前状态，对可用工具进行排序和筛选，只加载最相关的工具模式，从而避免不必要的token消耗。这种方法借鉴了“Attention Is All You Need”的思想，将注意力集中在关键信息上。

技术框架：Tool Attention的整体框架包含以下几个主要模块：1) 意图模式重叠（ISO）评分：利用句子嵌入技术计算Agent意图与工具模式之间的相似度，作为工具相关性的初步评估。2) 状态感知门控函数：根据Agent的当前状态（例如，已执行的操作、可用资源等）和工具的先决条件和访问范围，对工具进行门控，排除不适用的工具。3) 两阶段延迟模式加载器：维护一个紧凑的工具摘要池，仅在需要时才加载完整的JSON模式。首先，根据ISO评分和门控函数选择top-k个最相关的工具；然后，仅为这些工具加载完整的JSON模式。

关键创新：Tool Attention的关键创新在于将注意力机制应用于工具选择，实现了动态的工具门控和延迟模式加载。与传统的急切加载方式相比，Tool Attention能够显著减少每轮交互所需的token数量，提高上下文利用率，并降低计算成本。此外，状态感知门控函数能够确保工具的适用性，提高Agent的可靠性。

关键设计：ISO评分使用预训练的句子嵌入模型（例如，Sentence-BERT）计算意图和模式之间的相似度。状态感知门控函数可以根据具体应用场景进行定制，例如，可以根据用户权限、资源可用性等因素设置门控规则。两阶段延迟模式加载器使用LRU缓存来管理工具摘要池和完整的JSON模式。Top-k的选择可以根据实际情况进行调整，以平衡精度和效率。

📊 实验亮点

在模拟的120个工具、六服务器基准测试中，Tool Attention将每轮工具token数量从47.3k显著降低到2.4k，降幅达95.0%。同时，有效上下文利用率从24%提升至91%。这些数据表明，Tool Attention能够有效降低token消耗，提高上下文利用率，从而提升Agent系统的性能和效率（预测值）。

🎯 应用场景

Tool Attention可广泛应用于需要Agent与大量外部工具交互的场景，例如智能客服、自动化流程管理、智能家居控制等。通过降低token消耗和提高上下文利用率，Tool Attention能够显著提升Agent系统的可扩展性和效率，降低运营成本，并提高用户体验。该研究为构建更高效、更智能的Agent系统提供了新的思路。

📄 摘要（原文）

The Model Context Protocol (MCP) has become a common interface for connecting large language model (LLM) agents to external tools, but its reliance on stateless, eager schema injection imposes a hidden per-turn overhead the MCP Tax or Tools Tax that practitioner reports place between roughly 10k and 60k tokens in typical multi-server deployments. This payload inflates the key-value cache, is associated with reasoning degradation as context utilization approaches published fracture points around 70%, and turns token budgets into a recurring operational cost. We introduce Tool Attention, a middleware-layer mechanism that generalizes the "Attention Is All You Need" paradigm from self-attention over tokens to gated attention over tools. Tool Attention combines (i) an Intent Schema Overlap (ISO) score from sentence embeddings, (ii) a state-aware gating function enforcing preconditions and access scopes, and (iii) a two-phase lazy schema loader that keeps a compact summary pool in context and promotes full JSON schemas only for top-k gated tools. We evaluate on a simulated 120-tool, six-server benchmark whose per-server token counts are calibrated to public audits of real MCP deployments. In this simulation, Tool Attention directly reduces measured per-turn tool tokens by 95.0% (47.3k -> 2.4k) and raises effective context utilization (a token-ratio quantity) from 24% to 91%. End-to-end figures for task success, latency, cost, and reasoning quality are reported as projections derived from the measured token counts combined with published deployment telemetry; they are not measured on live LLM agents, and we mark projected values explicitly throughout. Taken together, the results support a simple thesis: protocol-level efficiency, not raw context length, is a binding constraint on scalable gentic systems. The code for this work is accessible at https://github.com/asadani/tool-attention

Tool Attention Is All You Need: Dynamic Tool Gating and Lazy Schema Loading for Eliminating the MCP/Tools Tax in Scalable Agentic Workflows

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理