Integrating External Tools with Large Language Models to Improve Accuracy

作者: Nripesh Niketan, Hadj Batatia

分类: cs.CL, cs.AI, cs.LG

发布日期: 2025-07-09

备注: 9 pages, 3 figures, 2 tables. Extended version of paper published in Proceedings of International Conference on Information Technology and Applications, Springer Nature Singapore, 2025, pp. 409-421. This version includes additional experimental results comparing against GPT-4o, LLaMA-Large, Mistral-Large, and Phi-Large, expanded evaluation methodology, and enhanced analysis

期刊: Proceedings of International Conference on Information Technology and Applications, Springer Nature Singapore, 2025, pp. 409-421

DOI: 10.1007/978-981-96-1758-6_34

💡 一句话要点

提出Athena框架，集成外部工具显著提升LLM在教育场景下的问题解答准确率

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 大型语言模型 外部工具集成 教育应用 知识增强 多模态语言理解

📋 核心要点

现有LLM在缺乏上下文信息时，容易产生幻觉或提供低质量回复，限制了其在教育等领域的应用。
Athena框架通过集成外部API和工具（如计算器、日历），为LLM提供实时数据和计算能力，增强其推理能力。
实验结果表明，Athena框架在数学和科学推理任务上显著优于GPT-4o、LLaMA-Large等先进模型，准确率分别达到83%和88%。

📝 摘要（中文）

本文旨在提升大型语言模型（LLM）的查询能力。众所周知，缺乏相关上下文信息时，LLM可能会提供质量较差的回复或产生幻觉。为此，一些研究提出了将LLM与外部工具集成，以便为LLM提供最新的数据，从而提高准确性。本文提出了一个框架，用于集成外部工具，以增强LLM在教育环境中回答问题的能力。具体来说，我们开发了一个框架，允许访问外部API来请求额外的相关信息。集成的工具还可以提供计算能力，例如计算器或日历。该框架使用多模态语言理解（MMLU）数据集进行了评估，数据包括数学和科学推理方面的问题。结果表明，与最先进的语言模型相比，该方法显著提高了性能。我们的Athena框架在数学推理方面达到了83%的准确率，在科学推理方面达到了88%的准确率，大大优于所有测试模型，包括GPT-4o、LLaMA-Large、Mistral-Large、Phi-Large和GPT-3.5，其中最佳基线模型（LLaMA-Large）分别仅达到67%和79%。这些有希望的结果为围绕LLM创建复杂的计算生态系统铺平了道路，使其使用更加自然，以支持各种任务和活动。

🔬 方法详解

问题定义：论文旨在解决大型语言模型（LLM）在教育场景下，由于缺乏相关上下文信息和计算能力，导致回答问题准确率低的问题。现有方法通常依赖于LLM自身的知识储备，难以处理需要实时信息或复杂计算的问题，容易产生幻觉或给出错误答案。

核心思路：论文的核心思路是将LLM与外部工具集成，构建一个能够访问外部API和利用外部计算资源的框架。通过这种方式，LLM可以获取最新的信息，进行复杂的计算，从而提高回答问题的准确率和可靠性。这种方法类似于赋予LLM使用工具的能力，使其能够更好地完成任务。

技术框架：Athena框架包含以下主要模块：1) 问题解析模块：接收用户提出的问题，并分析问题的类型和所需的信息。2) 工具选择模块：根据问题类型，选择合适的外部工具或API。例如，对于数学问题，选择计算器API；对于需要实时信息的问题，选择搜索引擎API。3) API调用模块：调用选定的API，获取相关信息或进行计算。4) 答案生成模块：将API返回的结果与原始问题结合，生成最终答案。整个流程是一个迭代的过程，LLM可以根据需要多次调用外部工具，直到获得满意的答案。

关键创新：该论文的关键创新在于将LLM与外部工具进行深度集成，构建了一个通用的框架，可以灵活地扩展到不同的教育场景和任务。与以往的研究相比，该框架不仅可以访问外部信息，还可以利用外部计算资源，从而显著提高了LLM的推理能力和问题解决能力。此外，该框架的设计考虑了可扩展性，可以方便地添加新的工具和API。

关键设计：论文中没有详细描述具体的参数设置、损失函数或网络结构等技术细节。但是，可以推断出，工具选择模块可能使用了某种形式的分类器或决策树，用于根据问题类型选择合适的工具。API调用模块需要处理不同API的接口和数据格式，可能需要进行数据转换和适配。答案生成模块可能使用了某种形式的自然语言生成模型，用于将API返回的结果与原始问题结合，生成流畅自然的答案。

🖼️ 关键图片

📊 实验亮点

Athena框架在MMLU数据集上的实验结果显著优于其他先进模型。在数学推理方面，Athena框架的准确率达到83%，而最佳基线模型LLaMA-Large仅为67%。在科学推理方面，Athena框架的准确率达到88%，而LLaMA-Large仅为79%。这些结果表明，通过集成外部工具，LLM的性能可以得到显著提升。

🎯 应用场景

该研究成果可广泛应用于在线教育、智能辅导、智能问答系统等领域。通过集成外部工具，LLM可以更准确、更可靠地回答学生提出的问题，提供个性化的学习支持。未来，该框架还可以扩展到其他领域，例如医疗诊断、金融分析等，为专业人士提供更强大的决策支持工具。

📄 摘要（原文）

This paper deals with improving querying large language models (LLMs). It is well-known that without relevant contextual information, LLMs can provide poor quality responses or tend to hallucinate. Several initiatives have proposed integrating LLMs with external tools to provide them with up-to-date data to improve accuracy. In this paper, we propose a framework to integrate external tools to enhance the capabilities of LLMs in answering queries in educational settings. Precisely, we develop a framework that allows accessing external APIs to request additional relevant information. Integrated tools can also provide computational capabilities such as calculators or calendars. The proposed framework has been evaluated using datasets from the Multi-Modal Language Understanding (MMLU) collection. The data consists of questions on mathematical and scientific reasoning. Results compared to state-of-the-art language models show that the proposed approach significantly improves performance. Our Athena framework achieves 83% accuracy in mathematical reasoning and 88% in scientific reasoning, substantially outperforming all tested models including GPT-4o, LLaMA-Large, Mistral-Large, Phi-Large, and GPT-3.5, with the best baseline model (LLaMA-Large) achieving only 67% and 79% respectively. These promising results open the way to creating complex computing ecosystems around LLMs to make their use more natural to support various tasks and activities.

Integrating External Tools with Large Language Models to Improve Accuracy

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理