Estimating LLM Grading Ability and Response Difficulty in Automatic Short Answer Grading via Item Response Theory
作者: Longwei Cong, Sonja Hahn, Sebastian Gombert, Leon Camus, Hendrik Drachsler, Ulf Kroehne
分类: cs.CL
发布日期: 2026-04-30
期刊: 2026 ACL Workshop BEA (21st Workshop on Innovative Use of NLP for Building Educational Applications)
💡 一句话要点
基于项目反应理论提出LLM自动短答案评分能力评估方法
🎯 匹配领域: 支柱九:具身大模型 (Embodied Foundation Models)
关键词: 自动评分 项目反应理论 大语言模型 教育评估 机器学习
📋 核心要点
- 现有的自动短答案评分方法主要依赖聚合指标,无法深入分析不同难度回答的评分表现。
- 本文提出了一种基于项目反应理论的评估框架,能够将评分正确性与评分者能力和响应难度关联起来。
- 实验结果显示,模型在评分难度增加时的准确性下降存在显著差异,且错误集中在部分正确的标签上。
📝 摘要(中文)
自动短答案评分(ASAG)通常使用宏F1和Cohen's kappa等聚合指标进行评估,但这些指标对不同评分难度的学生回答的表现变化提供的洞察有限。本文引入了一种基于项目反应理论(IRT)的LLM-ASAG评估框架,将评分正确性建模为潜在评分者能力和响应评分难度的函数。这一框架使得对LLM评分者成功或失败的响应级别分析成为可能,并揭示了聚合分数无法显示的鲁棒性差异。我们在SciEntsBank和Beetle基准上对17个开放权重的LLM进行了应用,结果表明,即使模型整体性能相似,其评分准确性在响应难度增加时的下降幅度也存在显著差异。
🔬 方法详解
问题定义:本文旨在解决现有自动短答案评分方法在评估不同难度回答时的不足,尤其是聚合指标无法反映的细节问题。
核心思路:通过引入项目反应理论(IRT),将评分正确性视为潜在评分者能力与响应难度的函数,从而实现对评分表现的深入分析。
技术框架:该框架包括潜在评分者能力的建模、响应难度的评估,以及基于这些因素的评分正确性分析,形成一个完整的评估流程。
关键创新:最重要的创新在于将IRT应用于LLM的自动评分系统,使得评分的细致分析成为可能,超越了传统的聚合性能评估。
关键设计:在模型设计中,采用了开放权重的LLM,并在两个基准数据集上进行了实验,关注语义对齐、矛盾信号和嵌入空间的语义孤立等关键参数。
🖼️ 关键图片
📊 实验亮点
实验结果表明,尽管模型的整体性能相似,但在响应难度增加时,评分准确性显著下降,且错误主要集中在部分正确的标签上。这一发现揭示了模型在处理复杂回答时的局限性,为未来的改进提供了方向。
🎯 应用场景
该研究的潜在应用领域包括教育评估、在线学习平台和智能辅导系统。通过更准确地评估学生的回答,教育工作者可以更好地理解学生的学习情况,从而提供个性化的反馈和支持,提升学习效果。
📄 摘要(原文)
Automated short answer grading (ASAG) with large language models (LLMs) is commonly evaluated with aggregate metrics such as macro-F1 and Cohen's kappa. However, these metrics provide limited insight into how grading performance varies across student responses of differing grading difficulty. We introduce an evaluation framework for LLM-based ASAG based on item response theory (IRT), which models grading correctness as a function of latent grader ability and response grading difficulty. This formulation enables response-level analysis of where LLM graders succeed or fail and reveals robustness differences that are not visible from aggregate scores alone. We apply the framework to 17 open-weight LLMs on the SciEntsBank and Beetle benchmarks. The results show that even models with similar overall performance differ substantially in how sharply their grading accuracy declines as response difficulty increases. In addition, confusion patterns show that errors on difficult responses concentrate disproportionately on the \texttt{partially_correct_incomplete} label, indicating a tendency toward intermediate-label collapse under ambiguity. To characterize difficult responses, we further analyze semantic and linguistic correlates of estimated difficulty. Across both datasets, higher difficulty is associated with weaker semantic alignment to the reference answer, stronger contradiction signals, and greater semantic isolation in embedding space. Overall, these results show that item response theory offers a useful framework for evaluating LLM-based ASAG beyond aggregate performance measures.