BioTool: A Comprehensive Tool-Calling Dataset for Enhancing Biomedical Capabilities of Large Language Models

作者: Xin Gao, Ruiyi Zhang, Meixi Du, Peijia Qin, Pengtao Xie

分类: cs.CL

发布日期: 2026-05-07

备注: Published at ACL 2026; Code and data available at https://github.com/gxx27/BioTool

🔗 代码/项目: GITHUB

💡 一句话要点

提出BioTool数据集以增强大语言模型在生物医学领域的工具调用能力

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 生物医学信息学 工具调用 大语言模型 指令微调 基因组学 蛋白质组学 人工智能辅助科研

📋 核心要点

现有LLM在生物医学领域缺乏对专业工具的有效调用能力，且现有方法多局限于上下文学习，难以应对复杂任务。
论文构建了包含34种生物医学常用工具及7,040对高质量人工标注数据的BioTool数据集，专门用于LLM的指令微调。
实验证明，经BioTool微调的模型在工具调用准确性上超越了GPT-5.1，并能显著提升生物医学下游任务的回答质量。

📝 摘要（中文）

尽管大语言模型（LLM）在通用任务中表现出色，但在生物医学等高度专业化领域仍面临挑战。其核心瓶颈在于模型无法有效利用临床专家和研究人员日常工作中依赖的生物医学工具。现有研究多依赖上下文学习（In-context Learning）且工具集受限。为此，本文提出了BioTool，这是一个专为微调LLM设计的综合性生物医学工具调用数据集。该数据集涵盖了来自NCBI、Ensembl和UniProt数据库的34种常用工具，包含7,040对高质量、经人工验证的查询-API调用对，涉及变异、基因组学、蛋白质组学、进化生物学等领域。实验表明，在BioTool上微调40亿参数的LLM，其生物医学工具调用性能显著提升，超越了GPT-5.1等前沿商业模型。人工评估进一步证实，集成BioTool微调后的工具调用器能显著改善下游任务的回答质量。

🔬 方法详解

问题定义：论文旨在解决大语言模型在生物医学领域“工具调用能力不足”的问题。现有方法主要依赖上下文学习（ICL），受限于提示词长度和模型推理能力，难以处理大规模、多步骤的专业生物医学数据库查询任务。

核心思路：通过构建大规模、高质量的指令微调数据集，将生物医学工具的调用逻辑显式地注入模型参数中。这种方法将工具调用从“提示工程”转变为“模型内化能力”，从而实现更稳定、更精准的API调用。

技术框架：数据集构建流程包括：从NCBI、Ensembl、UniProt等权威数据库筛选34个高频工具；通过人工编写与验证，构建7,040个查询-API调用对；采用监督微调（SFT）策略，对LLM进行针对性训练，使其学会根据生物医学问题自动生成正确的API参数。

关键创新：首次构建了专门针对生物医学领域的工具调用数据集，打破了通用领域数据集在专业垂直领域的局限性，实现了模型对复杂生物医学API调用的深度理解与执行。

关键设计：数据集覆盖了变异分析、基因组学、蛋白质组学等多个子领域，确保了工具调用的多样性与覆盖度；通过人工验证确保了API调用参数的准确性，为模型提供了高质量的监督信号。

🖼️ 关键图片

📊 实验亮点

实验结果显示，仅使用40亿参数的模型在BioTool上微调后，其工具调用性能即超越了GPT-5.1等顶尖商业模型。人工专家评估表明，集成该工具调用器后，模型在处理复杂生物医学问题时的准确性和可靠性显著提升，证明了数据集在增强模型专业领域执行力方面的有效性。

🎯 应用场景

该研究可广泛应用于生物医学科研辅助、临床决策支持系统及药物研发流程。通过赋予LLM直接调用NCBI、UniProt等权威数据库的能力，模型能实时获取最新的生物学数据，为基因序列分析、蛋白质功能预测及疾病机制研究提供精准的自动化支持，显著提升科研效率。

📄 摘要（原文）

Despite the success of large language models (LLMs) on general-purpose tasks, their performance in highly specialized domains such as biomedicine remains unsatisfactory. A key limitation is the inability of LLMs to effectively leverage biomedical tools, which clinical experts and biomedical researchers rely on extensively in daily workflows. While recent general-domain tool-calling datasets have substantially improved the capabilities of LLM agents, existing efforts in the biomedical domain largely rely on in-context learning and restrict models to a small set of tools. To address this gap, we introduce BioTool, a comprehensive biomedical tool-calling dataset designed for fine-tuning LLMs. BioTool comprises 34 frequently used tools collected from the NCBI, Ensembl, and UniProt databases, along with 7,040 high-quality, human-verified query-API call pairs spanning variation, genomics, proteomics, evolution, and general biology. Fine-tuning a 4-billion-parameter LLM on BioTool yields substantial improvements in biomedical tool-calling performance, outperforming cutting-edge commercial LLMs such as GPT-5.1. Furthermore, human expert evaluations demonstrate that integrating a BioTool-fine-tuned tool caller significantly improves downstream answer quality compared to the same LLM without tool usage, highlighting the effectiveness of BioTool in enhancing the biomedical capabilities of LLMs. The full dataset and evaluation code are available at https://github.com/gxx27/BioTool

BioTool: A Comprehensive Tool-Calling Dataset for Enhancing Biomedical Capabilities of Large Language Models

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理