DeepMath-Creative: A Benchmark for Evaluating Mathematical Creativity of Large Language Models

📄 arXiv: 2505.08744v1 📥 PDF

作者: Xiaoyang Chen, Xinan Dai, Yu Du, Qian Feng, Naixu Guo, Tingshuo Gu, Yuting Gao, Yingyi Gao, Xudong Han, Xiang Jiang, Yilin Jin, Hongyi Lin, Shisheng Lin, Xiangnan Li, Yuante Li, Yixing Li, Zhentao Lai, Zilu Ma, Yingrong Peng, Jiacheng Qian, Hao-Yu Sun, Jianbo Sun, Zirui Wang, Siwei Wu, Zian Wang, Bin Xu, Jianghao Xu, Yiyang Yu, Zichuan Yang, Hongji Zha, Ruichong Zhang

分类: cs.AI

发布日期: 2025-05-13

备注: 14 pages, 4 figures


💡 一句话要点

提出DeepMath-Creative以评估大语言模型的数学创造力

🎯 匹配领域: 支柱九:具身大模型 (Embodied Foundation Models)

关键词: 数学创造力 大语言模型 评估标准 DeepMath-Creative 开源计划 问题解决能力 教育技术

📋 核心要点

  1. 现有的数学LLMs主要关注推理能力,缺乏对创造力的系统评估和数据支持。
  2. 本文提出了数学创造力的评估标准,并构建了DeepMath-Creative基准数据集,涵盖多个数学领域。
  3. 实验表明,尽管LLMs在基础任务上表现尚可,但在复杂问题上表现不佳,缺乏真正的创造性解决方案。

📝 摘要(中文)

为了提升大语言模型(LLMs)的数学能力,DeepMath团队启动了一项开源计划,旨在开发开放的数学LLM并系统评估其数学创造力。本文是该计划的初步贡献。尽管近期的数学LLM发展主要集中在推理能力上,但对模型创造能力的关注相对较少,评估数据集也稀缺。为此,本文提出了数学创造力的评估标准,并引入了DeepMath-Creative,一个涵盖代数、几何、分析等领域的高质量基准数据集。通过该数据集,我们对主流LLMs的创造性问题解决能力进行了系统评估。实验结果显示,即使在宽松的评分标准下,表现最佳的模型O3 Mini在基础本科级构造任务上仅达到70%的准确率,且在更复杂问题上的表现急剧下降,模型未能提供实质性的开放问题解决策略。这表明,当前LLMs在熟悉且低难度问题上的表现可能源于记忆模式的重组,而非真正的创造性洞察或新颖综合。

🔬 方法详解

问题定义:本文旨在解决当前数学LLMs在创造力评估方面的不足,现有方法多集中于推理能力,缺乏对创造性问题解决能力的系统性评估。

核心思路:论文通过提出数学创造力的评估标准,并构建DeepMath-Creative数据集,来填补这一空白,旨在全面评估LLMs在创造性问题解决中的表现。

技术框架:整体架构包括数据集构建、评估标准制定和模型评估三个主要模块。数据集涵盖代数、几何、分析等领域的构造性问题,评估标准则侧重于核心解决方案的完整性。

关键创新:最重要的技术创新在于提出了数学创造力的评估标准,并构建了高质量的DeepMath-Creative基准数据集,这在现有文献中尚属首次。

关键设计:在实验中,采用了宽松的评分标准,关注核心解决方案的组成部分,同时忽略小的逻辑缺陷和冗余解释,以此来评估模型的创造性表现。

🖼️ 关键图片

fig_0
fig_1
fig_2

📊 实验亮点

实验结果显示,表现最佳的模型O3 Mini在基础本科级构造任务上仅达到70%的准确率,且在更复杂问题上的表现显著下降,未能提供有效的解决策略。这表明当前LLMs在创造性问题解决方面的能力仍有待提升。

🎯 应用场景

该研究的潜在应用领域包括教育、智能辅导系统和数学问题生成等。通过提升LLMs的数学创造力,可以为学生提供更具挑战性和启发性的学习材料,推动个性化学习的发展,未来可能对教育技术产生深远影响。

📄 摘要(原文)

To advance the mathematical proficiency of large language models (LLMs), the DeepMath team has launched an open-source initiative aimed at developing an open mathematical LLM and systematically evaluating its mathematical creativity. This paper represents the initial contribution of this initiative. While recent developments in mathematical LLMs have predominantly emphasized reasoning skills, as evidenced by benchmarks on elementary to undergraduate-level mathematical tasks, the creative capabilities of these models have received comparatively little attention, and evaluation datasets remain scarce. To address this gap, we propose an evaluation criteria for mathematical creativity and introduce DeepMath-Creative, a novel, high-quality benchmark comprising constructive problems across algebra, geometry, analysis, and other domains. We conduct a systematic evaluation of mainstream LLMs' creative problem-solving abilities using this dataset. Experimental results show that even under lenient scoring criteria -- emphasizing core solution components and disregarding minor inaccuracies, such as small logical gaps, incomplete justifications, or redundant explanations -- the best-performing model, O3 Mini, achieves merely 70% accuracy, primarily on basic undergraduate-level constructive tasks. Performance declines sharply on more complex problems, with models failing to provide substantive strategies for open problems. These findings suggest that, although current LLMs display a degree of constructive proficiency on familiar and lower-difficulty problems, such performance is likely attributable to the recombination of memorized patterns rather than authentic creative insight or novel synthesis.