Can LLMs Act as Historians? Evaluating Historical Research Capabilities of LLMs via the Chinese Imperial Examination

作者: Lirong Gao, Zeqing Wang, Yuyan Cai, Jiayi Deng, Yanmei Gu, Yiming Zhang, Jia Zhou, Yanfei Zhang, Junbo Zhao

分类: cs.CL

发布日期: 2026-04-27

备注: Accepted at ACL 2026

🔗 代码/项目: GITHUB

💡 一句话要点

提出ProHist-Bench基准，评估LLM在科举历史研究中的推理能力。

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 大型语言模型 历史研究 基准测试 科举制度 证据推理

📋 核心要点

现有LLM历史任务评估侧重知识广度与词汇理解，忽略了证据推理等高阶历史研究技能。
构建ProHist-Bench基准，以中国科举制度为载体，考察LLM在复杂历史情境下的推理能力。
实验表明，即使是最先进的LLM在ProHist-Bench上表现不佳，存在显著的专业能力差距。

📝 摘要（中文）

大型语言模型（LLM）越来越多地辅助历史任务，如文本处理，但它们在专业水平历史推理方面的能力仍未得到充分探索。现有的基准主要评估基本的知识广度或词汇理解，未能捕捉到历史研究的核心高阶技能，如证据推理。为了填补这一空白，我们引入了ProHist-Bench，这是一个以中国科举制度为基础的新基准，它是东亚政治、社会和思想史的综合缩影，跨越1300多年。ProHist-Bench通过深入的跨学科合作开发，包含400个具有挑战性的、专家策划的问题，涵盖八个朝代，并附有10891个细粒度的评估标准。通过对18个LLM的严格评估，我们发现了一个显著的熟练度差距：即使是最先进的LLM也难以应对复杂的历史研究问题。我们希望ProHist-Bench将促进领域特定推理LLM的开发，推进计算历史研究，并进一步揭示LLM的未开发潜力。我们在https://github.com/inclusionAI/ABench/tree/main/ProHist-Bench发布了ProHist-Bench。

🔬 方法详解

问题定义：现有的大型语言模型在历史相关的任务中表现出一定的能力，但现有的评估基准主要集中在知识的广度和词汇的理解上，缺乏对LLM进行高阶历史推理能力（例如，证据推理）的评估。因此，需要一个更具挑战性的基准来评估LLM在历史研究方面的真正能力。

核心思路：论文的核心思路是构建一个高质量、高难度的历史研究基准，该基准能够有效地评估LLM在历史领域的推理能力。选择中国科举制度作为基准的载体，因为它涵盖了东亚政治、社会和思想史的各个方面，并且时间跨度长达1300多年，具有很高的复杂性和挑战性。

技术框架：ProHist-Bench基准包含以下几个关键组成部分： 1. 问题设计：设计了400个具有挑战性的、专家策划的问题，涵盖了科举制度相关的各个方面。 2. 朝代覆盖：问题覆盖了八个不同的朝代，以确保基准的广泛性和代表性。 3. 评估标准：为每个问题设计了细粒度的评估标准（共10891个），以便对LLM的答案进行客观和准确的评估。 4. LLM评估：使用ProHist-Bench基准对18个LLM进行了评估，以了解它们在历史研究方面的能力。

关键创新：ProHist-Bench的关键创新在于其专注于评估LLM的高阶历史推理能力，而不仅仅是知识的广度。它通过精心设计的问题和细粒度的评估标准，能够更准确地衡量LLM在历史研究方面的表现。此外，选择科举制度作为基准的载体，也使得该基准具有很高的独特性和挑战性。

关键设计：ProHist-Bench的关键设计包括： 1. 问题难度：问题设计注重考察LLM的证据推理、逻辑推理和批判性思维能力，避免简单的事实性问题。 2. 评估细粒度：评估标准细化到每个问题的各个方面，以便更准确地评估LLM的答案。 3. 专家参与：问题设计和评估标准都经过了历史领域专家的审核，以确保其准确性和可靠性。

🖼️ 关键图片

📊 实验亮点

对18个LLM在ProHist-Bench上进行评估，结果表明即使是最先进的LLM在复杂的历史研究问题上表现不佳，揭示了LLM在专业历史推理能力方面的显著差距。这一结果强调了开发领域特定推理LLM的必要性。

🎯 应用场景

ProHist-Bench可用于训练和评估领域特定的LLM，提升其在历史研究、文化遗产保护、教育等领域的应用能力。该基准的构建思路也可推广到其他专业领域，促进LLM在更广泛的知识密集型任务中的应用。

📄 摘要（原文）

While Large Language Models (LLMs) have increasingly assisted in historical tasks such as text processing, their capacity for professional-level historical reasoning remains underexplored. Existing benchmarks primarily assess basic knowledge breadth or lexical understanding, failing to capture the higher-order skills, such as evidentiary reasoning,that are central to historical research. To fill this gap, we introduce ProHist-Bench, a novel benchmark anchored in the Chinese Imperial Examination (Keju) system, a comprehensive microcosm of East Asian political, social, and intellectual history spanning over 1,300 years. Developed through deep interdisciplinary collaboration, ProHist-Bench features 400 challenging, expert-curated questions across eight dynasties, accompanied by 10,891 fine-grained evaluation rubrics. Through a rigorous evaluation of 18 LLMs, we reveal a significant proficiency gap: even state-of-the-art LLMs struggle with complex historical research questions. We hope ProHist-Bench will facilitate the development of domain-specific reasoning LLMs, advance computational historical research, and further uncover the untapped potential of LLMs. We release ProHist-Bench at https://github.com/inclusionAI/ABench/tree/main/ProHist-Bench.

Can LLMs Act as Historians? Evaluating Historical Research Capabilities of LLMs via the Chinese Imperial Examination

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理