Grading Handwritten Engineering Exams with Multimodal Large Language Models

作者: Janez Perš, Jon Muhovič, Andrej Košir, Boštjan Murovec

分类: cs.CV

发布日期: 2026-01-02

备注: 10 pages, 5 figures, 2 tables. Supplementary material available at https://lmi.fe.uni-lj.si/en/janez-pers-2/supplementary-material/

💡 一句话要点

提出多模态大语言模型以解决手写工程考试评分问题

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 手写评分 多模态大语言模型 自动化评分 教育评估 机器学习

📋 核心要点

现有手写评分方法效率低，难以适应大规模考试的需求，人工评分过程繁琐且主观性强。
本文提出了一种基于多模态大语言模型的评分系统，利用手写参考答案和评分规则自动化评分过程，提升评分效率。
实验表明，所提系统在与讲师评分的比较中，平均绝对差为8分，且系统设计有效降低了评分偏差。

📝 摘要（中文）

手写STEM考试捕捉开放性推理和图示，但人工评分速度慢且难以扩展。本文提出了一种端到端的工作流程，利用多模态大语言模型对扫描的手写工程测验进行评分，保持标准考试流程。讲师仅需提供手写参考答案和简短评分规则，参考答案被转换为文本摘要以指导评分。通过多阶段设计确保可靠性，包括格式检查、独立评分员集成、监督聚合和严格模板，生成可审计的机器可解析报告。实验结果显示，该流程在斯洛文尼亚的真实课程测验中，评分与讲师评分的平均绝对差约为8分，且低偏差，手动复审触发率约为17%。

🔬 方法详解

问题定义：本文旨在解决手写工程考试的评分效率低和主观性强的问题。现有方法在处理开放性问题和图示时表现不佳，难以适应大规模评分需求。

核心思路：论文提出的解决方案是利用多模态大语言模型，通过将手写参考答案转换为文本摘要来指导评分，从而实现自动化和标准化的评分流程。

技术框架：整体架构包括多个模块：首先进行格式和存在性检查以防止评分空白答案；然后通过独立评分员的集成和监督聚合来提高评分的可靠性；最后生成可审计的机器可解析报告。

关键创新：最重要的技术创新在于通过结构化提示和参考基础的设计，显著提高了评分的准确性，避免了简单提示导致的系统性过度评分。

关键设计：在参数设置上，采用了多阶段评分设计，使用了严格的模板和确定性验证机制，确保评分过程的透明性和可审计性。

🖼️ 关键图片

📊 实验亮点

实验结果显示，所提评分系统在与讲师评分的比较中，平均绝对差为约8分，且系统设计有效降低了评分偏差，手动复审触发率约为17%。这些结果表明，该方法在保持评分准确性的同时，显著提升了评分效率。

🎯 应用场景

该研究的潜在应用领域包括教育评估、在线学习平台和自动化评分系统。通过提升评分效率和准确性，该方法能够帮助教育机构更好地管理大规模考试，减轻教师的评分负担，并提高学生的反馈速度，具有重要的实际价值和未来影响。

📄 摘要（原文）

Handwritten STEM exams capture open-ended reasoning and diagrams, but manual grading is slow and difficult to scale. We present an end-to-end workflow for grading scanned handwritten engineering quizzes with multimodal large language models (LLMs) that preserves the standard exam process (A4 paper, unconstrained student handwriting). The lecturer provides only a handwritten reference solution (100%) and a short set of grading rules; the reference is converted into a text-only summary that conditions grading without exposing the reference scan. Reliability is achieved through a multi-stage design with a format/presence check to prevent grading blank answers, an ensemble of independent graders, supervisor aggregation, and rigid templates with deterministic validation to produce auditable, machine-parseable reports. We evaluate the frozen pipeline in a clean-room protocol on a held-out real course quiz in Slovenian, including hand-drawn circuit schematics. With state-of-the-art backends (GPT-5.2 and Gemini-3 Pro), the full pipeline achieves $\approx$8-point mean absolute difference to lecturer grades with low bias and an estimated manual-review trigger rate of $\approx$17% at $D_{\max}=40$. Ablations show that trivial prompting and removing the reference solution substantially degrade accuracy and introduce systematic over-grading, confirming that structured prompting and reference grounding are essential.

Grading Handwritten Engineering Exams with Multimodal Large Language Models

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册