An Evaluation of the Pedagogical Soundness and Usability of AI-Generated Lesson Plans Across Different Models and Prompt Frameworks in High-School Physics

📄 arXiv: 2510.19866v1 📥 PDF

作者: Xincheng Liu

分类: cs.CL, cs.AI

发布日期: 2025-10-22

备注: 20 pages, 6 tables

DOI: 10.35542/osf.io/r3xkt_v1


💡 一句话要点

评估不同模型与提示框架下AI生成的高中物理课程计划的有效性

🎯 匹配领域: 支柱九:具身大模型 (Embodied Foundation Models)

关键词: AI生成课程计划 教学有效性 大型语言模型 结构化提示框架 高中物理 教育技术 自动化指标

📋 核心要点

  1. 现有的AI生成课程计划在教学有效性和可用性方面存在不足,尤其是在可读性和事实准确性上。
  2. 本研究通过比较不同大型语言模型和提示框架,提出了一种优化课程计划生成的方法,旨在提高教学质量。
  3. 实验结果显示,DeepSeek模型生成的课程计划可读性最佳,而RACE框架在事实准确性和课程标准对齐方面表现突出。

📝 摘要(中文)

本研究评估了五种领先的大型语言模型(ChatGPT、Claude、Gemini、DeepSeek和Grok)生成的课程计划的教学有效性和可用性。通过三种结构化提示框架(TAG、RACE和COSTAR),为高中物理主题“电磁谱”生成了十五个课程计划。研究分析了可读性、事实准确性、课程标准对齐和学习目标的认知需求等四个自动化指标。结果表明,模型选择对语言可读性影响最大,而提示框架结构则更影响事实准确性和教学完整性。最有效的课程计划配置是结合可读性优化的模型与RACE框架,并明确列出物理概念和课程标准。

🔬 方法详解

问题定义:本研究旨在解决AI生成的课程计划在教学有效性和可用性方面的不足,尤其是可读性和事实准确性的问题。现有方法未能充分考虑模型选择和提示框架对课程计划质量的影响。

核心思路:本研究的核心思路是通过比较不同大型语言模型和结构化提示框架,评估其对生成课程计划的影响,从而找到最优配置以提高教学效果。

技术框架:研究采用了五种大型语言模型和三种提示框架,生成了针对“电磁谱”主题的十五个课程计划。分析通过四个自动化指标进行,包括可读性、事实准确性、课程标准对齐和学习目标的认知需求。

关键创新:本研究的关键创新在于系统性地比较了不同模型和提示框架对课程计划的影响,特别是在可读性和教学完整性方面的深入分析,填补了现有研究的空白。

关键设计:研究中使用的关键设计包括选择了五种不同的语言模型(如DeepSeek和Claude)以及三种提示框架(TAG、RACE、COSTAR),并通过自动化指标评估生成的课程计划的质量。

📊 实验亮点

实验结果显示,DeepSeek模型生成的课程计划可读性最高(FKGL = 8.64),而Claude模型生成的语言最为密集(FKGL = 19.89)。RACE框架在事实准确性方面表现最佳,具有最低的幻觉指数和最高的课程标准对齐率。整体上,学习目标主要集中在布鲁姆分类法的记忆和理解层级,较少涉及高阶动词。

🎯 应用场景

该研究的潜在应用领域包括教育技术、在线学习平台和教师培训等。通过优化AI生成的课程计划,可以帮助教师更有效地设计课程,提高学生的学习体验和效果。未来,该研究可能推动AI在教育领域的更广泛应用,促进个性化学习和智能教育的发展。

📄 摘要(原文)

This study evaluates the pedagogical soundness and usability of AI-generated lesson plans across five leading large language models: ChatGPT (GPT-5), Claude Sonnet 4.5, Gemini 2.5 Flash, DeepSeek V3.2, and Grok 4. Beyond model choice, three structured prompt frameworks were tested: TAG (Task, Audience, Goal), RACE (Role, Audience, Context, Execution), and COSTAR (Context, Objective, Style, Tone, Audience, Response Format). Fifteen lesson plans were generated for a single high-school physics topic, The Electromagnetic Spectrum. The lesson plans were analyzed through four automated computational metrics: (1) readability and linguistic complexity, (2) factual accuracy and hallucination detection, (3) standards and curriculum alignment, and (4) cognitive demand of learning objectives. Results indicate that model selection exerted the strongest influence on linguistic accessibility, with DeepSeek producing the most readable teaching plan (FKGL = 8.64) and Claude generating the densest language (FKGL = 19.89). The prompt framework structure most strongly affected the factual accuracy and pedagogical completeness, with the RACE framework yielding the lowest hallucination index and the highest incidental alignment with NGSS curriculum standards. Across all models, the learning objectives in the fifteen lesson plans clustered at the Remember and Understand tiers of Bloom's taxonomy. There were limited higher-order verbs in the learning objectives extracted. Overall, the findings suggest that readability is significantly governed by model design, while instructional reliability and curricular alignment depend more on the prompt framework. The most effective configuration for lesson plans identified in the results was to combine a readability-optimized model with the RACE framework and an explicit checklist of physics concepts, curriculum standards, and higher-order objectives.