Investigating Automatic Scoring and Feedback using Large Language Models

作者: Gloria Ashiya Katuka, Alexander Gain, Yen-Yun Yu

分类: cs.CL, cs.LG

发布日期: 2024-05-01

💡 一句话要点

利用PEFT微调量化LLaMA-2模型，实现低成本、低延迟的自动评分与反馈生成。

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 大型语言模型 参数高效微调 模型量化 自动评分 反馈生成 教育应用 LLaMA-2

📋 核心要点

传统自动评分与反馈方法计算成本高昂，且难以达到专家水平，限制了其应用。
采用PEFT方法微调量化LLMs，显著降低计算资源需求，同时保持甚至提升性能。
实验表明，该方法在评分准确率和反馈质量上均表现出色，具有实际应用价值。

📝 摘要（中文）

本文研究了利用大型语言模型（LLMs）进行自动评分和反馈生成。尽管LLMs性能优异，但微调需要大量计算资源。为解决此问题，本文探索了基于参数高效微调（PEFT）的量化模型，如LoRA和QLoRA，以降低模型微调的内存和计算需求。具体而言，本文采用分类或回归头，微调LLMs以自动为简答题和论文分配连续数值分数，并生成相应的反馈。实验在专有和开源数据集上进行。结果表明，微调后的LLMs在预测分数方面非常准确，平均误差小于3%。对于提供评分反馈，微调后的4-bit量化LLaMA-2 13B模型优于竞争基线模型，并在BLEU和ROUGE分数以及反馈质量方面与专家反馈高度相似。该研究为使用量化方法微调LLMs以用于各种下游任务（如自动简答题评分和反馈生成）提供了重要见解，且成本和延迟相对较低。

🔬 方法详解

问题定义：本文旨在解决自动评分和反馈生成的问题。现有方法，如基于传统机器学习和深度学习的方法，在处理复杂文本时表现不佳，且需要大量的标注数据和计算资源。大型语言模型（LLMs）虽然性能强大，但直接微调成本过高，阻碍了其在教育领域的广泛应用。

核心思路：本文的核心思路是利用参数高效微调（PEFT）技术，特别是LoRA和QLoRA，结合模型量化，来降低LLMs的微调成本，使其能够在有限的计算资源下实现高效的自动评分和反馈生成。通过量化，模型体积更小，推理速度更快。

技术框架：整体框架包括以下几个主要阶段：1) 选择预训练的LLM（如LLaMA-2）；2) 对LLM进行量化，例如使用4-bit量化；3) 使用PEFT方法（如LoRA或QLoRA）在评分和反馈生成数据集上微调LLM，添加分类或回归头用于评分预测；4) 使用微调后的模型进行自动评分和反馈生成；5) 评估模型性能，包括评分准确率和反馈质量。

关键创新：本文的关键创新在于将PEFT和模型量化相结合，应用于自动评分和反馈生成任务。这种方法能够在保持模型性能的同时，显著降低微调成本和推理延迟。此外，本文还探索了不同PEFT方法和量化策略对模型性能的影响。

关键设计：在微调过程中，使用了分类或回归头来预测分数。损失函数根据任务类型选择，例如，回归任务可以使用均方误差（MSE），分类任务可以使用交叉熵损失。LoRA的rank参数和学习率是关键的超参数，需要根据具体数据集进行调整。量化级别（如4-bit）的选择需要在模型大小、推理速度和性能之间进行权衡。

🖼️ 关键图片

📊 实验亮点

实验结果表明，使用PEFT微调的4-bit量化LLaMA-2 13B模型在自动评分任务中表现出色，平均评分误差小于3%。在反馈生成方面，该模型在BLEU和ROUGE指标上均优于基线模型，并且生成的反馈在质量上与专家反馈高度相似。这表明该方法能够在低成本下实现高质量的自动评分和反馈生成。

🎯 应用场景

该研究成果可广泛应用于在线教育平台、自动阅卷系统和个性化学习辅助工具。通过自动评分和反馈生成，可以减轻教师的负担，提高教学效率，并为学生提供及时有效的学习支持。此外，该技术还可以应用于其他自然语言处理任务，如文本摘要、机器翻译等。

📄 摘要（原文）

Automatic grading and feedback have been long studied using traditional machine learning and deep learning techniques using language models. With the recent accessibility to high performing large language models (LLMs) like LLaMA-2, there is an opportunity to investigate the use of these LLMs for automatic grading and feedback generation. Despite the increase in performance, LLMs require significant computational resources for fine-tuning and additional specific adjustments to enhance their performance for such tasks. To address these issues, Parameter Efficient Fine-tuning (PEFT) methods, such as LoRA and QLoRA, have been adopted to decrease memory and computational requirements in model fine-tuning. This paper explores the efficacy of PEFT-based quantized models, employing classification or regression head, to fine-tune LLMs for automatically assigning continuous numerical grades to short answers and essays, as well as generating corresponding feedback. We conducted experiments on both proprietary and open-source datasets for our tasks. The results show that prediction of grade scores via finetuned LLMs are highly accurate, achieving less than 3% error in grade percentage on average. For providing graded feedback fine-tuned 4-bit quantized LLaMA-2 13B models outperform competitive base models and achieve high similarity with subject matter expert feedback in terms of high BLEU and ROUGE scores and qualitatively in terms of feedback. The findings from this study provide important insights into the impacts of the emerging capabilities of using quantization approaches to fine-tune LLMs for various downstream tasks, such as automatic short answer scoring and feedback generation at comparatively lower costs and latency.

Investigating Automatic Scoring and Feedback using Large Language Models

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理