Evaluating the Feasibility and Accuracy of Large Language Models for Medical History-Taking in Obstetrics and Gynecology

作者: Dou Liu, Ying Long, Sophia Zuoqiu, Tian Tang, Rong Yin

分类: cs.CL, cs.AI

发布日期: 2025-03-31

备注: Accepted by IISE 2025 annual conference

💡 一句话要点

评估大型语言模型在妇产科病史采集中的可行性和准确性

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 大型语言模型 病史采集 不孕不育 ChatGPT-4o ChatGPT-4o-mini 医疗AI 自然语言处理 诊断辅助

📋 核心要点

传统不孕不育病史采集耗时，影响诊所效率，且依赖医生经验，存在信息遗漏风险。
利用ChatGPT-4o和ChatGPT-4o-mini构建AI对话系统，模拟医患互动，自动采集病史。
实验表明，ChatGPT-4o-mini在信息提取和病史完整性方面优于ChatGPT-4o，提升诊断准确性。

📝 摘要（中文）

在诊断前环境中，有效的医患沟通至关重要，尤其是在不孕不育等复杂和敏感的医疗领域，但耗时较长，导致诊所工作流程效率低下。大型语言模型（LLM）的最新进展为自动化对话式病史采集和提高诊断准确性提供了潜在的解决方案。本研究评估了LLM在不孕不育病例中执行这些任务的可行性和性能。开发了一种AI驱动的对话系统，使用ChatGPT-4o和ChatGPT-4o-mini模拟医患互动。处理了总共70个真实世界的不孕不育病例，生成了420个诊断病史。使用F1分数、鉴别诊断（DDs）准确性和不孕类型判断（ITJ）准确性评估模型性能。ChatGPT-4o-mini在信息提取准确性方面优于ChatGPT-4o（F1分数：0.9258 vs. 0.9029，p = 0.045，d = 0.244），并在病史采集的完整性方面表现出更高的水平（97.58% vs. 77.11%），表明ChatGPT-4o-mini在提取详细的患者信息方面更有效，这对于提高诊断准确性至关重要。相比之下，ChatGPT-4o在鉴别诊断准确性方面略好（2.0524 vs. 2.0048，p > 0.05）。ChatGPT-4o-mini的ITJ准确性更高（0.6476 vs. 0.5905），但一致性较低（Cronbach's α = 0.562），表明分类可靠性存在差异。两种模型都证明了在自动化不孕不育病史采集方面的强大可行性，其中ChatGPT-4o-mini在完整性和提取准确性方面表现出色。未来的研究应优先考虑临床环境中准确性和可靠性的专家验证、AI模型微调以及包含各种不孕不育病例的更大数据集。

🔬 方法详解

问题定义：论文旨在解决不孕不育病例中，传统病史采集效率低、耗时长的痛点。现有方法依赖人工，易受医生经验和时间限制，可能导致信息不完整，影响诊断准确性。

核心思路：论文的核心思路是利用大型语言模型（LLM）的对话能力，构建AI驱动的对话系统，模拟医患互动，自动、高效地采集病史。通过对比不同LLM的性能，寻找更适合该任务的模型。

技术框架：整体框架包含以下步骤：1) 收集真实世界的不孕不育病例数据；2) 使用ChatGPT-4o和ChatGPT-4o-mini构建对话系统，模拟医患对话，自动生成病史；3) 使用F1分数、鉴别诊断准确性和不孕类型判断准确性等指标评估模型性能；4) 对比不同模型的优劣，分析原因。

关键创新：论文的关键创新在于将大型语言模型应用于不孕不育病史采集，并对比了不同LLM在该任务上的性能差异。研究发现，ChatGPT-4o-mini在信息提取和病史完整性方面表现更佳，这与现有研究中对通用LLM性能的评估有所不同。

关键设计：论文的关键设计包括：1) 使用真实世界的不孕不育病例数据，保证研究的实用性；2) 使用F1分数、鉴别诊断准确性和不孕类型判断准确性等多维度指标评估模型性能；3) 对比ChatGPT-4o和ChatGPT-4o-mini的性能，分析其差异，为后续模型选择提供依据。Cronbach's α 用于评估不孕类型判断的一致性。

📊 实验亮点

实验结果表明，ChatGPT-4o-mini在信息提取准确性（F1分数：0.9258 vs. 0.9029，p = 0.045，d = 0.244）和病史完整性（97.58% vs. 77.11%）方面优于ChatGPT-4o。虽然ChatGPT-4o在鉴别诊断准确性方面略胜一筹，但ChatGPT-4o-mini在整体性能上更适合不孕不育病史采集任务。

🎯 应用场景

该研究成果可应用于辅助医生进行不孕不育病史采集，提高诊疗效率，减少医生工作负担。未来可扩展到其他医疗领域，实现更智能化的病史采集和诊断辅助，提升医疗服务质量，降低医疗成本。该技术还有潜力应用于远程医疗和患者自助服务。

📄 摘要（原文）

Effective physician-patient communications in pre-diagnostic environments, and most specifically in complex and sensitive medical areas such as infertility, are critical but consume a lot of time and, therefore, cause clinic workflows to become inefficient. Recent advancements in Large Language Models (LLMs) offer a potential solution for automating conversational medical history-taking and improving diagnostic accuracy. This study evaluates the feasibility and performance of LLMs in those tasks for infertility cases. An AI-driven conversational system was developed to simulate physician-patient interactions with ChatGPT-4o and ChatGPT-4o-mini. A total of 70 real-world infertility cases were processed, generating 420 diagnostic histories. Model performance was assessed using F1 score, Differential Diagnosis (DDs) Accuracy, and Accuracy of Infertility Type Judgment (ITJ). ChatGPT-4o-mini outperformed ChatGPT-4o in information extraction accuracy (F1 score: 0.9258 vs. 0.9029, p = 0.045, d = 0.244) and demonstrated higher completeness in medical history-taking (97.58% vs. 77.11%), suggesting that ChatGPT-4o-mini is more effective in extracting detailed patient information, which is critical for improving diagnostic accuracy. In contrast, ChatGPT-4o performed slightly better in differential diagnosis accuracy (2.0524 vs. 2.0048, p > 0.05). ITJ accuracy was higher in ChatGPT-4o-mini (0.6476 vs. 0.5905) but with lower consistency (Cronbach's $α$ = 0.562), suggesting variability in classification reliability. Both models demonstrated strong feasibility in automating infertility history-taking, with ChatGPT-4o-mini excelling in completeness and extraction accuracy. In future studies, expert validation for accuracy and dependability in a clinical setting, AI model fine-tuning, and larger datasets with a mix of cases of infertility have to be prioritized.

Evaluating the Feasibility and Accuracy of Large Language Models for Medical History-Taking in Obstetrics and Gynecology

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理