Distilling Large Language Models for Matching Patients to Clinical Trials

作者: Mauro Nievas, Aditya Basu, Yanshan Wang, Hrituraj Singh

分类: cs.AI, cs.IR

发布日期: 2023-12-15

💡 一句话要点

利用大语言模型蒸馏，实现患者与临床试验的高效匹配

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 大型语言模型 临床试验匹配 知识蒸馏 合成数据 开源模型 医疗健康 LLAMA GPT-4

📋 核心要点

现有患者-试验匹配方法依赖人工或复杂的特征工程，效率低且难以泛化，而直接使用大型语言模型具有潜力。
本研究提出利用GPT-4生成合成数据，并在此基础上微调开源LLM，使其在患者-试验匹配任务上达到与闭源模型相当的性能。
实验结果表明，微调后的开源LLM（Trial-LLAMA）在患者-试验匹配任务上表现出色，为实际医疗应用提供了可行方案。

📝 摘要（中文）

大型语言模型（LLM）的成功为它们在医疗保健领域中的应用铺平了道路。特别是在患者-试验匹配方面，LLM通过评估患者是否符合临床试验的纳入和排除标准，展现出巨大潜力。GPT-3.5在这一任务上表现优异，但其闭源特性带来成本、隐私和可重复性问题。本研究系统性地评估了闭源（GPT-3.5、GPT-4）和开源（LLAMA 7B、13B、70B）LLM在患者-试验匹配任务中的有效性。通过多方面的评估框架，包括自动化评估、人工评估和详细的错误分析，我们发现，在利用GPT-4生成的合成数据集上进行微调后，开源LLM的性能可以与闭源LLM相媲美。这为它们在实际医疗保健应用中的部署提供了机会。为了促进进一步的研究和应用，我们公开发布了带注释的评估数据集以及微调后的LLM——Trial-LLAMA。

🔬 方法详解

问题定义：论文旨在解决患者与临床试验匹配的问题。现有方法，如依赖人工规则或传统机器学习模型，需要大量人工特征工程，泛化能力差，且难以处理复杂的临床试验标准。闭源LLM虽然效果好，但存在成本高昂、数据隐私和结果不可复现等问题。

核心思路：核心思路是利用闭源LLM（GPT-4）的强大生成能力，生成高质量的合成数据，然后用这些合成数据微调开源LLM。这样既能利用LLM的强大能力，又能避免闭源LLM的缺点。通过蒸馏的方式，将闭源LLM的知识迁移到开源LLM上。

技术框架：整体框架包括以下几个阶段：1) 使用GPT-4生成合成的患者-试验匹配数据集。2) 使用生成的合成数据集微调开源LLM（LLAMA 7B、13B、70B）。3) 使用真实数据集评估微调后的开源LLM，并与闭源LLM（GPT-3.5、GPT-4）进行比较。4) 进行人工评估和错误分析，深入了解模型的性能。

关键创新：关键创新在于利用GPT-4生成合成数据，并用其微调开源LLM，实现了开源LLM在患者-试验匹配任务上与闭源LLM性能相当。这种方法解决了开源LLM在数据量不足时性能较差的问题，并避免了闭源LLM的成本和隐私问题。

关键设计：合成数据的生成过程是关键。论文可能使用了特定的prompt engineering技巧来指导GPT-4生成高质量的、多样化的患者-试验匹配数据。微调过程中，可能使用了特定的学习率、batch size等超参数，以及合适的损失函数（例如交叉熵损失）。具体网络结构就是LLAMA系列的结构，没有做大的修改。

📊 实验亮点

实验结果表明，在利用GPT-4生成的合成数据集上进行微调后，开源LLM（Trial-LLAMA）的性能可以与闭源LLM（GPT-3.5、GPT-4）相媲美。这意味着在患者-试验匹配任务上，开源LLM可以达到与闭源LLM相当的精度，同时避免了闭源LLM的成本和隐私问题。具体性能数据未知，但论文强调了性能parity。

🎯 应用场景

该研究成果可应用于临床试验招募、患者治疗方案推荐等领域。通过自动匹配患者与合适的临床试验，可以加速新药研发，提高患者的治疗效率和生存率。开源LLM的应用降低了医疗机构的成本，并保障了患者数据的隐私安全，具有重要的社会价值。

📄 摘要（原文）

The recent success of large language models (LLMs) has paved the way for their adoption in the high-stakes domain of healthcare. Specifically, the application of LLMs in patient-trial matching, which involves assessing patient eligibility against clinical trial's nuanced inclusion and exclusion criteria, has shown promise. Recent research has shown that GPT-3.5, a widely recognized LLM developed by OpenAI, can outperform existing methods with minimal 'variable engineering' by simply comparing clinical trial information against patient summaries. However, there are significant challenges associated with using closed-source proprietary LLMs like GPT-3.5 in practical healthcare applications, such as cost, privacy and reproducibility concerns. To address these issues, this study presents the first systematic examination of the efficacy of both proprietary (GPT-3.5, and GPT-4) and open-source LLMs (LLAMA 7B,13B, and 70B) for the task of patient-trial matching. Employing a multifaceted evaluation framework, we conducted extensive automated and human-centric assessments coupled with a detailed error analysis for each model. To enhance the adaptability of open-source LLMs, we have created a specialized synthetic dataset utilizing GPT-4, enabling effective fine-tuning under constrained data conditions. Our findings reveal that open-source LLMs, when fine-tuned on this limited and synthetic dataset, demonstrate performance parity with their proprietary counterparts. This presents a massive opportunity for their deployment in real-world healthcare applications. To foster further research and applications in this field, we release both the annotated evaluation dataset along with the fine-tuned LLM -- Trial-LLAMA -- for public use.

Distilling Large Language Models for Matching Patients to Clinical Trials

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册