PeruMedQA: Benchmarking Large Language Models (LLMs) on Peruvian Medical Exams -- Dataset Construction and Evaluation

作者: Rodrigo M. Carrillo-Larco, Jesus Lovón Melgarejo, Manuel Castillo-Cara, Gusseppe Bravo-Rocca

分类: cs.CL, cs.LG

发布日期: 2025-09-15

备注: https://github.com/rodrigo-carrillo/PeruMedQA

💡 一句话要点

PeruMedQA：构建秘鲁医学考试数据集，评估大型语言模型性能

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 医学问答 大型语言模型 西班牙语 数据集构建 参数高效微调

📋 核心要点

现有医学LLM在英语医学问题上表现良好，但缺乏对西班牙语和拉丁美洲医学问题的有效评估。
论文构建了PeruMedQA数据集，并使用参数高效微调方法，针对秘鲁医学考试问题对LLM进行微调。
实验表明，medgemma-27b-text-it和微调后的medgemma-4b-it在PeruMedQA数据集上表现出色。

📝 摘要（中文）

背景：医学大型语言模型（LLM）在回答医学考试问题方面表现出色。然而，这种高性能在多大程度上能迁移到西班牙语和拉丁美洲国家的医学问题上仍未可知。了解这一点至关重要，因为基于LLM的医学应用在拉丁美洲越来越受欢迎。目的：构建一个包含秘鲁医生参加的专业培训医学考试题目的数据集；在此数据集上微调LLM；评估和比较vanilla LLM和微调LLM在准确性方面的性能。方法：我们整理了PeruMedQA，一个多项选择题问答（MCQA）数据集，包含跨越12个医学领域的8380个问题（2018-2025）。我们选择了包括medgemma-4b-it和medgemma-27b-text-it在内的八个医学LLM，并开发了零样本特定任务提示来适当地回答问题。我们采用参数高效微调（PEFT）和低秩适应（LoRA）来微调medgemma-4b-it，利用除2025年以外的所有问题（测试集）。结果：medgemma-27b-text-it优于所有其他模型，在多个实例中实现了超过90%的正确答案比例。参数小于100亿的LLM表现出低于60%的正确答案比例，而一些考试的结果低于50%。微调版本的medgemma-4b-it再次战胜了所有参数小于100亿的LLM，并在各种考试中与一个拥有700亿参数的LLM相媲美。结论：对于需要来自西班牙语国家和那些表现出与秘鲁类似流行病学特征的知识库的医学AI应用和研究，感兴趣的各方应使用medgemma-27b-text-it或微调版本的medgemma-4b-it。

🔬 方法详解

问题定义：论文旨在解决医学LLM在西班牙语和拉丁美洲医学考试问题上的性能评估问题。现有方法主要集中在英语数据集上，缺乏对其他语言和地区的适应性评估，导致模型在这些地区的实际应用效果未知。现有方法的痛点在于缺乏高质量的西班牙语医学考试数据集，以及针对该数据集的有效评估和微调方法。

核心思路：论文的核心思路是构建一个高质量的秘鲁医学考试数据集（PeruMedQA），并利用该数据集对现有的医学LLM进行评估和微调。通过这种方式，可以了解LLM在西班牙语医学问题上的实际性能，并提升其在该领域的应用能力。论文选择参数高效微调方法，降低了微调的计算成本。

技术框架：论文的技术框架主要包括以下几个步骤：1) 数据集构建：收集并整理秘鲁医学考试题目，构建PeruMedQA数据集。2) 模型选择：选择多个医学LLM进行评估，包括medgemma-4b-it和medgemma-27b-text-it等。3) 零样本评估：使用零样本特定任务提示对LLM进行评估。4) 微调：使用参数高效微调（PEFT）和低秩适应（LoRA）方法对medgemma-4b-it进行微调。5) 性能比较：比较vanilla LLM和微调LLM在PeruMedQA数据集上的性能。

关键创新：论文的关键创新在于：1) 构建了PeruMedQA数据集，填补了西班牙语医学考试数据集的空白。2) 采用参数高效微调方法，降低了微调的计算成本。3) 评估了多个医学LLM在PeruMedQA数据集上的性能，为后续研究提供了参考。

关键设计：论文的关键设计包括：1) 数据集的构建：确保数据集的质量和多样性，覆盖多个医学领域。2) 提示工程：设计有效的零样本特定任务提示，提高LLM的回答准确率。3) 微调策略：选择合适的参数高效微调方法和参数设置，优化微调效果。论文使用了Low-Rank Adaptation (LoRA) 进行参数高效微调。

📊 实验亮点

实验结果表明，medgemma-27b-text-it在PeruMedQA数据集上表现出色，在多个实例中实现了超过90%的正确答案比例。微调后的medgemma-4b-it也取得了显著的性能提升，与拥有700亿参数的LLM相媲美。参数小于100亿的LLM表现出低于60%的正确答案比例，一些考试的结果低于50%。

🎯 应用场景

该研究成果可应用于开发面向西班牙语地区的医学AI应用，例如辅助诊断、医学知识问答、医学教育等。PeruMedQA数据集可以作为评估和提升医学LLM在西班牙语医学领域性能的基准。研究结果有助于推动医学AI在拉丁美洲的应用和发展。

📄 摘要（原文）

BACKGROUND: Medical large language models (LLMS) have demonstrated remarkable performance in answering medical examinations. However, the extent to which this high performance is transferable to medical questions in Spanish and from a Latin American country remains unexplored. This knowledge is crucial as LLM-based medical applications gain traction in Latin America. AIMS: to build a dataset of questions from medical examinations taken by Peruvian physicians pursuing specialty training; to fine-tune a LLM on this dataset; to evaluate and compare the performance in terms of accuracy between vanilla LLMs and the fine-tuned LLM. METHODS: We curated PeruMedQA, a multiple-choice question-answering (MCQA) datasets containing 8,380 questions spanning 12 medical domains (2018-2025). We selected eight medical LLMs including medgemma-4b-it and medgemma-27b-text-it, and developed zero-shot task-specific prompts to answer the questions appropriately. We employed parameter-efficient fine tuning (PEFT)and low-rant adaptation (LoRA) to fine-tune medgemma-4b-it utilizing all questions except those from 2025 (test set). RESULTS: medgemma-27b-text-it outperformed all other models, achieving a proportion of correct answers exceeding 90% in several instances. LLMs with <10 billion parameters exhibited <60% of correct answers, while some exams yielded results <50%. The fine-tuned version of medgemma-4b-it emerged victorious agains all LLMs with <10 billion parameters and rivaled a LLM with 70 billion parameters across various examinations. CONCLUSIONS: For medical AI application and research that require knowledge bases from Spanish-speaking countries and those exhibiting similar epidemiological profiles to Peru's, interested parties should utilize medgemma-27b-text-it or a fine-tuned version of medgemma-4b-it.

PeruMedQA: Benchmarking Large Language Models (LLMs) on Peruvian Medical Exams -- Dataset Construction and Evaluation

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理