Large Language Models Lack Temporal Awareness of Medical Knowledge

作者: Zihan Guan, Qiao Jin, Guangzhi Xiong, Fangyuan Chen, Mengxuan Hu, Qingyu Chen, Yifan Peng, Zhiyong Lu, Anil Vullikanti

分类: cs.LG, cs.CL

发布日期: 2026-05-13

备注: 35 pages, 18 figures

💡 一句话要点

TempoMed-Bench揭示大语言模型缺乏医学知识的时间感知能力

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 大语言模型 医学知识 时间感知 知识演变 评估基准 TempoMed-Bench 临床决策支持

📋 核心要点

现有医学知识评估方法忽略了医学知识随时间演变的特性，导致评估结果不完整。
构建TempoMed-Bench基准，通过评估模型在不同时间点的医学知识掌握程度来衡量时间感知能力。
实验表明，大语言模型在时间感知方面存在不足，对历史知识回忆较差，且预测结果不稳定。

📝 摘要（中文）

现有评估大语言模型(LLMs)医学知识的方法主要基于静态的考试式基准，然而医学知识本质上是动态的，随着新证据的出现和治疗方法的批准而不断发展。因此，在缺乏时间背景下评估医学知识可能无法完整评估LLMs是否能准确推理特定时间点的医学知识。此外，大多数医学数据是历史性的，要求模型不仅要回忆正确的知识，还要知道该知识何时正确。为了弥补这一差距，我们构建了TempoMed-Bench，这是首个通过不断发展的指南知识来评估LLMs在医学领域时间感知能力的基准。基于TempoMed-Bench，我们的评估分析揭示了LLMs缺乏医学知识的时间感知能力，主要发现包括：（1）模型在最新医学知识上的表现随着时间的推移呈现逐渐线性下降，而不是急剧的知识截止行为，表明参数化医学知识并非严格受知识截止限制；（2）LLMs在回忆过时的历史医学知识方面始终比最新的建议更困难：历史知识的准确率仅为最新知识的25.37%-53.89%，表明训练过程中可能存在知识遗忘效应；（3）LLMs经常表现出时间上不一致的行为，预测在相邻年份之间不规则波动。我们还表明，当与代理搜索工具集成时，时间感知问题是一个难以解决的挑战（-3.15%-14.14%）。这项工作强调了一个重要但未被充分探索的挑战，并激发了未来开发能够更好编码特定时间医学知识的LLMs的研究。

🔬 方法详解

问题定义：论文旨在解决大语言模型（LLMs）在医学领域缺乏时间感知能力的问题。现有评估方法主要基于静态数据集，无法反映医学知识随时间演变的特性。这导致LLMs难以区分不同时间点的医学知识，例如过时的治疗方法和最新的临床指南，从而影响其在实际医疗场景中的应用。现有方法的痛点在于无法准确评估LLMs对时间敏感的医学知识的掌握程度。

核心思路：论文的核心思路是构建一个时间敏感的医学知识评估基准，即TempoMed-Bench。该基准包含随时间演变的医学指南知识，通过评估LLMs在不同时间点对这些知识的掌握程度，来衡量其时间感知能力。这样设计的目的是为了模拟真实世界中医学知识不断更新的场景，从而更全面地评估LLMs的医学知识水平。

技术框架：TempoMed-Bench的整体框架包括以下几个主要组成部分：1) 数据收集：收集随时间演变的医学指南知识，例如不同年份的治疗方案和临床建议。2) 数据处理：将收集到的数据整理成适合LLMs处理的格式，例如问答对。3) 模型评估：使用TempoMed-Bench评估LLMs在不同时间点对医学知识的掌握程度，并分析其时间感知能力。4) 结果分析：分析实验结果，揭示LLMs在时间感知方面存在的不足，并提出改进建议。

关键创新：论文最重要的技术创新点在于构建了TempoMed-Bench，这是首个用于评估LLMs在医学领域时间感知能力的基准。与现有静态基准相比，TempoMed-Bench能够更全面地评估LLMs的医学知识水平，并揭示其在时间感知方面存在的不足。此外，论文还通过实验分析，深入探讨了LLMs在时间感知方面的表现，并提出了改进建议。

关键设计：TempoMed-Bench的关键设计包括：1) 时间跨度：覆盖足够长的时间跨度，以反映医学知识的演变过程。2) 知识类型：包含多种类型的医学知识，例如诊断、治疗和预防。3) 评估指标：使用准确率、召回率和F1值等指标来评估LLMs的性能。4) 对比基线：选择多个具有代表性的LLMs作为对比基线，以评估TempoMed-Bench的有效性。

🖼️ 关键图片

📊 实验亮点

实验结果表明，LLMs在TempoMed-Bench上的表现随着时间的推移呈现线性下降，历史知识的准确率仅为最新知识的25.37%-53.89%，且模型预测结果在相邻年份之间存在不一致性。即使集成代理搜索工具，时间感知问题仍然难以解决（-3.15%-14.14%）。这些结果表明，LLMs在医学知识的时间感知方面存在显著不足。

🎯 应用场景

该研究成果可应用于提升医疗大语言模型在临床决策支持、医学知识问答等场景中的准确性和可靠性。通过提高模型的时间感知能力，可以减少因使用过时知识而导致的误诊误治风险，并为医生提供更准确、及时的信息支持。未来，该研究可促进开发更智能、更可靠的医疗人工智能系统。

📄 摘要（原文）

The existing methods for evaluating the medical knowledge of Large Language Models (LLMs) are largely based on atemporal examination-style benchmarks, while in reality, medical knowledge is inherently dynamic and continuously evolves as new evidence emerges and treatments are approved. Consequently, evaluating medical knowledge without a temporal context may provide an incomplete assessment of whether LLMs can accurately reason about time-specific medical knowledge. Moreover, most medical data are historical, requiring the models not only to recall the correct knowledge, but also to know when that knowledge is correct. To bridge the gap, we built TempoMed-Bench, the first-of-its-kind benchmark for evaluating the temporal awareness of the LLMs in the medical domain through evolving guideline knowledge. Based on the TempoMed-Bench, our evaluation analysis first reveals that LLMs lack temporal awareness in medical knowledge through the key findings: (1) model performance on up-to-date medical knowledge exhibits a gradual linear decline over time rather than a sharp knowledge-cutoff behavior, suggesting that parametric medical knowledge is not strictly bounded by knowledge cutoffs; (2) LLMs consistently struggle more with recalling outdated historical medical knowledge than with up-to-date recommendations: accuracy of historical knowledge is only 25.37%-53.89% of up-to-date knowledge, indicating potential knowledge forgetting effects during training; and (3) LLMs often exhibit temporally inconsistent behaviors, where predictions fluctuate irregularly across neighboring years. We also show that the temporal awareness problem is a challenge that cannot be easily solved when integrated with agentic search tools (-3.15%-14.14%). This work highlights an important yet underexplored challenge and motivates future research on developing LLMs that can better encode time-specific medical knowledge.

Large Language Models Lack Temporal Awareness of Medical Knowledge

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理