Automating Autograding: Large Language Models as Test Suite Generators for Introductory Programming

作者: Umar Alkafaween, Ibrahim Albluwi, Paul Denny

分类: cs.CY, cs.AI

发布日期: 2024-11-14 (更新: 2024-11-18)

备注: Submitted to Journal of Computer Assisted Learning; updated table refs

期刊: J Comput Assist Learn, 41: e13100 (2025)

DOI: 10.1111/jcal.13100

💡 一句话要点

利用大型语言模型自动生成测试用例，提升编程教学自动评测效率。

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 大型语言模型 自动评测 测试用例生成 编程教学 GPT-4

📋 核心要点

自动评测系统测试用例构建耗时，覆盖率不足影响学生反馈质量，阻碍编程教学。
利用GPT-4等大型语言模型，根据问题描述和参考答案自动生成测试用例集。
实验表明，LLM生成的测试用例集能有效识别正确解，且覆盖率不低于教师构建的测试集。

📝 摘要（中文）

自动评判编程作业能够为学生提供即时反馈，并显著减少教师的手动评分时间。然而，为自动评分器中的编程问题创建全面的测试用例集可能既耗时又复杂。定义测试用例集所需的工作量可能会阻止一些教师创建更多问题，或者导致测试覆盖率不足，从而可能导致对学生解决方案的误导性反馈。这种限制可能会减少学生在学习编程时获得及时反馈的益处。本文评估了使用大型语言模型（LLM）作为更大工作流程的一部分，自动生成CS1级别编程问题的测试用例集的有效性。每个问题的陈述和参考解决方案都提供给GPT-4，以生成可供自动评分器使用的测试用例集。我们使用26个问题的样本以及由入门编程课程的学生提交的超过25,000个问题解决方案来评估我们提出的方法。我们将LLM生成的测试用例集的性能与教师为每个问题创建的测试用例集进行比较。我们的研究结果表明，LLM生成的测试用例集可以正确识别大多数有效解决方案，并且对于大多数问题，至少与教师测试用例集一样全面。此外，LLM生成的测试用例集暴露了一些问题陈述中的歧义，突显了它们在改进自动评分和教学设计方面的潜力。

🔬 方法详解

问题定义：论文旨在解决CS1级别编程课程中，自动评测系统测试用例集构建耗时且覆盖率不足的问题。现有方法依赖人工设计测试用例，效率低且容易遗漏边界情况，导致学生获得的反馈不准确，影响学习效果。

核心思路：论文的核心思路是利用大型语言模型（LLM）的强大代码生成和理解能力，将问题描述和参考答案作为输入，自动生成高质量的测试用例集。这种方法旨在减少人工干预，提高测试用例的覆盖率和准确性。

技术框架：该方法包含以下步骤：1) 收集CS1级别编程问题及其参考答案；2) 将问题描述和参考答案输入GPT-4等LLM；3) LLM生成测试用例集；4) 使用生成的测试用例集对学生提交的答案进行自动评测；5) 将LLM生成的测试用例集与人工设计的测试用例集进行比较，评估其性能。

关键创新：该方法的主要创新在于利用LLM自动生成测试用例，无需人工干预，显著提高了测试用例生成的效率。此外，LLM能够理解问题描述和参考答案，生成更具针对性和覆盖性的测试用例，从而提高了自动评测的准确性。

关键设计：论文使用了GPT-4作为LLM，并将其作为黑盒使用，没有涉及LLM的微调或特定参数设置。关键在于如何将问题描述和参考答案有效地输入LLM，并指导LLM生成合适的测试用例。论文没有详细说明具体的prompt工程细节，这部分可能是影响最终效果的关键因素。

🖼️ 关键图片

📊 实验亮点

实验结果表明，LLM生成的测试用例集能够正确识别大多数有效解决方案，并且对于大多数问题，至少与教师创建的测试用例集一样全面。此外，LLM生成的测试用例集还暴露了一些问题陈述中的歧义，表明其具有改进教学设计的潜力。

🎯 应用场景

该研究成果可应用于各种编程教学平台和在线编程练习系统，帮助教师快速构建高质量的自动评测系统，减少人工批改负担，提高学生学习效率。此外，该方法还可用于软件测试领域，自动生成测试用例，提高软件质量。

📄 摘要（原文）

Automatically graded programming assignments provide instant feedback to students and significantly reduce manual grading time for instructors. However, creating comprehensive suites of test cases for programming problems within automatic graders can be time-consuming and complex. The effort needed to define test suites may deter some instructors from creating additional problems or lead to inadequate test coverage, potentially resulting in misleading feedback on student solutions. Such limitations may reduce student access to the well-documented benefits of timely feedback when learning programming. In this work, we evaluate the effectiveness of using Large Language Models (LLMs), as part of a larger workflow, to automatically generate test suites for CS1-level programming problems. Each problem's statement and reference solution are provided to GPT-4 to produce a test suite that can be used by an autograder. We evaluate our proposed approach using a sample of 26 problems, and more than 25,000 attempted solutions to those problems, submitted by students in an introductory programming course. We compare the performance of the LLM-generated test suites against the instructor-created test suites for each problem. Our findings reveal that LLM-generated test suites can correctly identify most valid solutions, and for most problems are at least as comprehensive as the instructor test suites. Additionally, the LLM-generated test suites exposed ambiguities in some problem statements, underscoring their potential to improve both autograding and instructional design.

Automating Autograding: Large Language Models as Test Suite Generators for Introductory Programming

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理