Elmes*: Automated Construction of Fine-Grained Evaluation Rubrics for Large Language Models in Long-Tail Educational Scenarios

作者: Tao Liu, Ye Lu, Ruohua Zhang, Siyu Song, Wentao Liu, Aimin Zhou, Hao Hao

分类: cs.LG

发布日期: 2026-06-04

💡 一句话要点

提出Elmes*框架以解决教育场景中LLM评估的不足问题

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 大型语言模型 教育评估 自动化评分 多代理系统 细粒度评估 长尾场景 自我演化模块

📋 核心要点

现有评估方法主要关注模型的知识水平，缺乏对其教学能力的全面评估，尤其在长尾教育场景中表现不佳。
Elmes*框架通过结合多代理引擎和自我演化模块，自动构建和优化细粒度的教育评估标准，提升评估的适应性和准确性。
实验结果表明，教育能力是多维度的，顶级LLMs在创造力和价值整合方面存在显著差异，教育专用模型InnoSpark在人工评估中表现最佳。

📝 摘要（中文）

评估大型语言模型（LLMs）在教育中的表现需要测量模型的教学能力，而不仅仅是其知识水平。现有基准主要强调领域通用的正确性，或依赖于手动设计的评分标准，这在长尾教育场景中扩展性较差。本文提出Elmes，一个端到端框架，用于构建、优化和应用细粒度的场景特定评分标准。Elmes结合了一个声明式的多代理引擎，用于教师、学生和评审者之间的互动，以及一个自我演化模块SceneGen，能够从专家定义的教育维度中共同优化评估标准和测试数据。通过Elmes*，我们构建了Edu-330，涵盖11个学科、3个年级段和10种任务类型的330个场景，拥有超过1000个二级指标。

🔬 方法详解

问题定义：本文旨在解决现有大型语言模型在教育场景中的评估不足，尤其是缺乏对教学能力的全面评估，现有方法在长尾场景中的扩展性较差。

核心思路：Elmes*框架通过结合多代理互动和自我演化模块，自动构建和优化细粒度的评估标准，以适应多样化的教育场景，从而提升评估的有效性和准确性。

技术框架：Elmes*的整体架构包括多个主要模块：一个多代理引擎用于教师、学生和评审者之间的互动，以及SceneGen模块用于自我演化和优化评估标准和测试数据。

关键创新：Elmes*的核心创新在于其自动化构建和优化细粒度评分标准的能力，显著提升了评估的适应性和准确性，区别于传统手动设计的评分标准。

关键设计：在设计中，Elmes*采用了专家定义的教育维度来指导评估标准的优化，使用了少量样本锚定来提高人类与LLM的对齐度，同时在推理强制和贪婪解码方面的表现依赖于具体模型。

🖼️ 关键图片

📊 实验亮点

实验结果显示，使用Elmes*构建的Edu-330在评估中表现出多维度的教育能力，顶级LLMs在创造力和价值整合方面的差异显著。教育专用模型InnoSpark在人工评估中获得了最佳平均分，且LLM评审者的评分方差显著低于人类评审者。

🎯 应用场景

Elmes*框架在教育评估领域具有广泛的应用潜力，能够为教育工作者提供更为精准的评估工具，帮助他们更好地理解和提升学生的学习效果。此外，该框架的设计理念也可扩展至其他领域的模型评估，具有重要的实际价值和未来影响。

📄 摘要（原文）

Evaluating large language models (LLMs) for education requires measuring how models teach, not only what they know. Existing benchmarks emphasize domain-general correctness or depend on manually designed rubrics that scale poorly to long-tail pedagogical scenarios. We introduce Elmes, an end-to-end framework for constructing, refining, and applying fine-grained scenario-specific rubrics. Elmes combines a declarative multi-agent engine for teacher--student--judge interactions with SceneGen, a self-evolving module that co-optimizes evaluation criteria and test data from expert-defined pedagogical dimensions. Using Elmes, we build Edu-330, covering 330 scenarios across 11 subjects, 3 grade bands, and 10 task types, with over 1{,}000 second-level indicators. Experiments on Edu-330 and four expert-authored gold-standard scenarios show that educational capability is multidimensional: top-tier LLMs differ mainly in creativity and values integration, knowledge-strong models may fail at Socratic scaffolding, and the education-specialized InnoSpark achieves the best human-evaluated average score. LLM judges preserve human-comparable rankings with much lower scoring variance, but exhibit judge-specific biases such as self-preference. Ablations show that expert-scored few-shot anchoring improves human--LLM alignment, while reasoning enforcement and greedy decoding are model-dependent. Elmes thus provides scalable diagnostic infrastructure for pedagogically grounded LLM evaluation.

Elmes*: Automated Construction of Fine-Grained Evaluation Rubrics for Large Language Models in Long-Tail Educational Scenarios

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理