DEFINED: A Data-Efficient Computational Framework for Fine-Grained Creativity Assessment in Debate Scenarios
作者: Tongzhou Yu, Mingjia Li, Hong Qian, Wenkai Wang, Zongbao Zhang, Yaoyu Jiang, Xiangfeng Wang, Aimin Zhou, Jiajun Guo
分类: cs.LG, cs.AI, cs.CL
发布日期: 2026-06-05
备注: Accepted by KDD 2026
💡 一句话要点
提出DEFINED框架以解决辩论场景中的创造力评估问题
🎯 匹配领域: 支柱九:具身大模型 (Embodied Foundation Models)
关键词: 创造力评估 辩论分析 数据增强 层次评分 机器学习 自然语言处理 教育技术
📋 核心要点
- 现有的自动评分方法在复杂的辩论场景中表现不佳,仍依赖于昂贵的人力评估。
- 论文提出了DEFINED框架,通过八维度的层次度量系统实现辩论创造力的细粒度评估。
- 实验结果表明,DEFINED在评分准确性和稳定性上超越了基于提示的大型语言模型评估器和现有辩论评分方法。
📝 摘要(中文)
人类创造力在大型语言模型时代已成为关键能力,但在复杂开放环境中的创造力评估仍面临挑战,主要受限于标准化简单任务和缺乏细粒度专家数据。辩论作为生态有效的评估环境,反映了创造力的多个维度。为此,本文提出了DEFINED,一个数据高效的计算框架,用于辩论场景中的细粒度创造力评估。该框架通过一个八维层次度量系统来操作辩论创造力,并采用预训练的自回归语言模型与层次评分头,支持细粒度和粗粒度评估。通过真实辩论比赛获取的陈述及其专家评分,结合受限的数据增强策略,DEFINED实现了从有限的细粒度监督中进行稳健学习,并在评估中表现出优于现有方法的准确性和稳定性。
🔬 方法详解
问题定义:本文旨在解决在辩论场景中对创造力进行细粒度评估的挑战,现有方法依赖于简单任务,无法适应复杂环境,且缺乏足够的专家数据支持。
核心思路:论文提出的DEFINED框架通过构建一个八维度的层次度量系统,结合预训练的自回归语言模型,能够有效评估辩论中的创造力,克服了传统方法的局限性。
技术框架:DEFINED框架包括数据收集、层次度量系统构建、模型训练和评估四个主要模块。首先从真实辩论比赛中收集数据,然后构建八维度的评分标准,接着利用混合粒度训练策略进行模型训练,最后进行评分评估。
关键创新:最重要的创新在于引入了层次化的八维度评分系统和混合粒度训练策略,使得模型能够在有限的细粒度监督下实现稳健学习,显著提高了评估的准确性和可靠性。
关键设计:在模型设计中,采用了层次评分头以支持不同粒度的评估,并通过受限的数据增强策略来减轻原始数据中的精英偏见,确保模型在多样化数据上的表现。
🖼️ 关键图片
📊 实验亮点
在实验中,DEFINED框架的评分模型在准确性和稳定性上表现优异,超越了现有的基于提示的大型语言模型评估器,且在与传统辩论评分方法的对比中,显示出显著的性能提升,具体数据未提供,标记为未知。
🎯 应用场景
该研究的潜在应用领域包括教育评估、辩论训练和人工智能辅助的创意评估等。通过提供高效的创造力评估工具,能够帮助教育工作者和训练者更好地理解和提升学生的创造力,推动教育和训练的个性化发展。未来,该框架也可能扩展到其他需要创造力评估的领域,如艺术创作和创新管理等。
📄 摘要(原文)
Human creativity has emerged as a critical competency in the era of large language models. Assessing creativity in complex, open-ended environments is a grand challenge in data mining, currently hindered by a reliance on standardized simple tasks and the scarcity of fine-grained expert data. As an ecologically valid assessment context, debate reflects multiple dimensions of creativity, encompassing both divergent thinking and convergent thinking. Moreover, debate is a data-rich domain, with a large volume of publicly accessible materials. Current mainstream automated scoring methods are poorly suited to complex settings such as debate, and therefore still rely on costly human evaluation. To this end, this paper proposes DEFINED, a data-efficient computational framework for fine-grained creativity assessment in debate scenarios. DEFINED operationalizes debate creativity through a hierarchical eight-dimensional metric system, implemented via a pre-trained autoregressive language model with a hierarchical scoring head that supports both fine-grained and coarse-grained evaluation. Statements and their associated expert scores were obtained from authentic debate competitions, and a constrained data augmentation strategy was employed to address the elite bias inherent in the original data. DEFINED adopts a mixed-granularity training strategy enabling robust learning from limited fine-grained supervision annotated by trained graduate experts. To rigorously validate ecological validity beyond synthetic benchmarks, we incorporate an empirical study with debate-naive participants, utilizing these authentic data to serve as a qualitative case study for mid-to-low proficiency populations. Across our evaluation protocol, our scoring model achieves accurate and stable scoring, outperforming prompt-based large language model evaluators and existing debate scoring methods.