SciEx: Benchmarking Large Language Models on Scientific Exams with Human Expert Grading and Automatic Grading

作者: Tu Anh Dinh, Carlos Mullov, Leonard Bärmann, Zhaolin Li, Danni Liu, Simon Reiß, Jueun Lee, Nathan Lerzer, Fabian Ternava, Jianfeng Gao, Tobias Röddiger, Alexander Waibel, Tamim Asfour, Michael Beigl, Rainer Stiefelhagen, Carsten Dachsbacher, Klemens Böhm, Jan Niehues

分类: cs.CL

发布日期: 2024-06-14 (更新: 2024-10-02)

备注: Accepted to EMNLP 2024 Main Conference

💡 一句话要点

SciEx：提出一个基于大学计算机科学考试题的LLM评测基准，包含人工和自动评分。

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 大型语言模型 评测基准 计算机科学 考试题目 多模态 自由形式问题 人工评分 自动评分

📋 核心要点

现有LLM缺乏在科学领域，特别是计算机科学考试场景下的有效评估基准，难以准确衡量其能力。
SciEx基准包含多语言、多模态的自由形式问题，模拟大学考试，更贴近实际应用场景，能更全面地评估LLM。
实验表明，现有最佳LLM在SciEx上的平均考试成绩仅为59.4%，同时验证了LLM作为评分者的可行性，为未来研究奠定基础。

📝 摘要（中文）

随着大型语言模型（LLMs）的快速发展，拥有能够在不同领域评估LLMs能力的基准至关重要。LLMs的一个常见用途是在科学主题上执行任务，例如编写算法、查询数据库或给出数学证明。受大学学生在此类任务上的评估方式的启发，本文提出了SciEx——一个由大学计算机科学考试题组成的基准，用于评估LLMs解决科学任务的能力。SciEx具有以下特点：（1）多语言，包含英语和德语考试；（2）多模态，包含涉及图像的问题；（3）包含各种类型的自由形式问题，难度各不相同，这归因于大学考试的性质。我们评估了各种最先进的LLMs在我们新基准上的性能。由于SciEx问题是自由形式的，因此评估LLM性能并不简单。因此，我们提供了LLM输出在SciEx上的人工专家评分。我们表明，SciEx中的自由形式考试对于当前的LLMs仍然具有挑战性，其中最佳LLM平均仅达到59.4％的考试成绩。我们还提供了LLM性能与学生在SciEx上的表现之间的详细比较。为了能够对新的LLMs进行未来评估，我们建议使用LLM-as-a-judge来评分LLM在SciEx上的答案。我们的实验表明，尽管它们在解决考试方面表现不佳，但LLMs作为评分者表现不错，与专家评分的相关性达到0.948。

🔬 方法详解

问题定义：论文旨在解决缺乏针对大型语言模型（LLMs）在计算机科学领域进行有效评估的基准问题。现有方法难以准确评估LLMs在解决科学问题，特别是大学考试题方面的能力。现有的评估方法通常采用选择题或填空题等形式，无法充分考察LLMs的自由回答和问题解决能力。

核心思路：论文的核心思路是构建一个更贴近实际应用场景的评估基准，即SciEx，它模拟了大学计算机科学考试，包含多语言（英语和德语）、多模态（包含图像）以及各种难度级别的自由形式问题。通过人工专家评分和LLM自动评分相结合的方式，更全面地评估LLMs的性能。

技术框架：SciEx基准的构建流程主要包括以下几个阶段： 1. 数据集构建：收集大学计算机科学考试题，涵盖不同难度级别和主题。 2. 数据标注：由人工专家对问题进行标注，并提供参考答案。 3. LLM评估：使用不同的LLMs在SciEx上进行测试，并记录其回答。 4. 评分：由人工专家对LLMs的回答进行评分，同时使用LLM-as-a-judge进行自动评分。 5. 性能分析：比较LLMs的性能，并与学生在SciEx上的表现进行对比。

关键创新：该论文的关键创新在于： 1. SciEx基准：提出了一个更贴近实际应用场景的计算机科学考试基准，包含多语言、多模态和自由形式问题。 2. LLM-as-a-judge：探索了使用LLMs作为评分者的可行性，并验证了其与人工专家评分的高度相关性。

关键设计： 1. 多语言支持：SciEx包含英语和德语两种语言的考试题，以评估LLMs的多语言能力。 2. 多模态支持：SciEx包含涉及图像的问题，以评估LLMs的多模态理解能力。 3. 自由形式问题：SciEx包含各种类型的自由形式问题，以评估LLMs的自由回答和问题解决能力。 4. 评分机制：采用人工专家评分和LLM自动评分相结合的方式，以提高评分的准确性和效率。

🖼️ 关键图片

fig_0

fig_1

fig_2

📊 实验亮点

实验结果表明，当前最佳LLM在SciEx上的平均考试成绩仅为59.4%，表明自由形式的计算机科学考试对LLMs仍然具有挑战性。同时，实验验证了LLM-as-a-judge的可行性，其与人工专家评分的相关性达到0.948，为未来自动评估LLM性能提供了新的思路。

🎯 应用场景

SciEx基准可用于评估和比较不同LLMs在计算机科学领域的性能，帮助研究人员了解LLMs的优势和不足。此外，LLM-as-a-judge的应用可以降低人工评分的成本，提高评估效率。该研究成果有助于推动LLMs在教育、科研等领域的应用，例如智能辅导系统、自动代码生成等。

📄 摘要（原文）

With the rapid development of Large Language Models (LLMs), it is crucial to have benchmarks which can evaluate the ability of LLMs on different domains. One common use of LLMs is performing tasks on scientific topics, such as writing algorithms, querying databases or giving mathematical proofs. Inspired by the way university students are evaluated on such tasks, in this paper, we propose SciEx - a benchmark consisting of university computer science exam questions, to evaluate LLMs ability on solving scientific tasks. SciEx is (1) multilingual, containing both English and German exams, and (2) multi-modal, containing questions that involve images, and (3) contains various types of freeform questions with different difficulty levels, due to the nature of university exams. We evaluate the performance of various state-of-the-art LLMs on our new benchmark. Since SciEx questions are freeform, it is not straightforward to evaluate LLM performance. Therefore, we provide human expert grading of the LLM outputs on SciEx. We show that the free-form exams in SciEx remain challenging for the current LLMs, where the best LLM only achieves 59.4\% exam grade on average. We also provide detailed comparisons between LLM performance and student performance on SciEx. To enable future evaluation of new LLMs, we propose using LLM-as-a-judge to grade the LLM answers on SciEx. Our experiments show that, although they do not perform perfectly on solving the exams, LLMs are decent as graders, achieving 0.948 Pearson correlation with expert grading.