Evaluating Large Language Models in Dynamic Clinical Decision-Making with Standardized Patient Cases
作者: Cheng Liang, Pengcheng Qiu, Ya Zhang, Yanfeng Wang, Chaoyi Wu, Weidi Xie
分类: cs.CL
发布日期: 2026-06-03
💡 一句话要点
提出MedSP1000以解决临床决策动态评估问题
🎯 匹配领域: 支柱九:具身大模型 (Embodied Foundation Models)
关键词: 大型语言模型 临床决策 标准化病人 动态评估 医学教育 交互式基准 医疗代理
📋 核心要点
- 现有的静态单轮基准无法捕捉模型在动态医疗决策中的表现,导致评估不足。
- 本文提出MedSP1000,通过标准化病人案例创建交互式评估基准,提升临床代理的评估效果。
- 实验结果显示,最佳模型仅完成60.4%的专家定义评分项,表明当前LLMs在临床应用中的可靠性不足。
📝 摘要(中文)
大型语言模型(LLMs)在临床应用中逐渐受到关注,但现有的静态单轮基准无法有效评估模型在动态医疗决策中的表现。为此,本文提出了MedSP1000,一个基于标准化病人的交互式基准,包含1,638个病例和24,602个同行评审的评分标准。MedSP1000将标准化病人教学案例转化为可执行的场景,允许临床代理在模拟中与病人代理和环境控制器进行闭环互动。研究发现,当前的LLMs在动态评估中的表现远低于静态基准,表明它们尚未具备安全融入临床实践的可靠性。
🔬 方法详解
问题定义:本文旨在解决现有大型语言模型在动态临床决策中的评估不足,静态基准无法反映模型在实际医疗场景中的表现。
核心思路:提出MedSP1000,通过标准化病人(SP)案例创建一个交互式评估基准,模拟真实的临床决策过程,以更全面地评估模型的能力。
技术框架:MedSP1000包含1,638个标准化病人案例,配备24,602个同行评审的评分标准。每次模拟评估中,临床代理与病人代理和环境控制器进行闭环互动,评估其行为。
关键创新:MedSP1000的创新在于将标准化病人教学案例转化为可执行的场景,允许动态评估,揭示传统静态基准无法捕捉的临床相关失败模式。
关键设计:在设计中,定义了SP案例脚本、临床环境上下文和人类验证的结构化评分标准,确保评估的客观性和一致性。
📊 实验亮点
实验结果表明,最佳模型GPT-5.5仅完成60.4%的专家定义评分项,而最强的医学专用模型则达到40.0%。增加测试时计算资源并未带来显著提升,显示出当前LLMs在临床实践中的应用仍需改进。
🎯 应用场景
该研究的潜在应用领域包括医学教育、临床培训和医疗决策支持系统。通过提供更真实的评估环境,MedSP1000能够帮助开发更可靠的临床代理,提升医疗服务的质量和安全性。
📄 摘要(原文)
Large language models (LLMs) are increasingly proposed as clinical agents, yet static, single-turn benchmarks cannot capture how a model dynamically delivers care across an encounter: gathering information, planning treatment, and adapting longitudinal management across successive patient states. Medical education has long addressed an analogous challenge through standardized patients (SPs): trained actors who consistently portray clinical cases, enabling realistic practice and objective, scripted assessment. Here we introduce MedSP1000, an SP-derived interactive benchmark for clinical-agent evaluation, including 1,638 SP cases with 24,602 trajectory-level peer-reviewed rubrics. MedSP1000 converts peer-reviewed SP teaching cases into executable scenarios with defined SP case scripts, clinical environment contexts, and human-validated structured rubric. In each simulation evaluation run, a clinical agent interacts in closed loop with a patient agent and an environment controller, and its behaviour is scored throughout the encounter against expert criteria specified in the original materials. Applying MedSP1000 to a range of general-purpose and medically specialized LLMs, we find that performance on static benchmarks does not reliably translate to such educational scenarios. The best-performing model, GPT-5.5, completes only 60.4% of expert-defined rubric items, whereas the strongest medically specialized model reaches 40.0%; increasing test-time compute produces no measurable gain. These results suggest that current LLMs, including agentic systems tuned for medicine, are not yet reliable enough to be safely integrated into actual clinical practice. More broadly, MedSP1000 shows how process-level, SP-style evaluation can reveal clinically relevant failure modes that single-turn benchmarks miss.