Max It or Miss It: Benchmarking LLM On Solving Extremal Problems

作者: Binxin Gao, Jingjun Han

分类: cs.LG, cs.AI, cs.CL

发布日期: 2025-10-14 (更新: 2025-10-17)

备注: Our benchmark dataset is available at https://huggingface.co/datasets/binxingao/extrem-bench

💡 一句话要点

提出ExtremBench基准以评估LLM在极值问题求解中的能力

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 大型语言模型 极值问题 数学推理 基准评估 优化推理 数据集构建 智能系统

📋 核心要点

现有的数学基准未能全面评估LLMs在极值问题求解中的能力，存在明显的评估差距。
本文提出ExtremBench基准数据集，专注于系统评估LLMs在极值求解中的推理能力。
实验结果显示，LLMs在极值求解能力与传统数学基准之间存在显著差异，揭示了当前评估方法的不足。

📝 摘要（中文）

测试时缩放使大型语言模型（LLMs）在数学领域展现出卓越的推理能力，尤其是在生成最终答案之前的中间链式推理。然而，这些推理能力的具体来源和机制仍不够清晰。优化推理，即在约束条件下寻找极值，代表了一种基础抽象，支撑着规划、控制、资源分配和提示搜索等关键应用。为系统评估这一能力，本文引入了ExtremBench，一个用于解决数学极值问题的基准数据集，来源于中国数学奥林匹克的习题，并转化为93个标准化的极值寻找问题。我们对多种最先进的开源模型进行了广泛评估，结果显示LLMs在极值求解上的推理能力与当前的数学基准（如AIME25和MATH-500）并不总是一致，部分模型在一般数学推理上表现强劲，但在极值求解上却表现不佳，反之亦然。这一差异突显了当前评估实践中的关键缺口，并表明现有基准可能无法全面捕捉数学推理能力的全貌。

🔬 方法详解

问题定义：本文旨在解决大型语言模型在极值问题求解中的评估不足，现有方法未能有效捕捉其推理能力的全貌。

核心思路：通过引入ExtremBench基准数据集，系统化评估LLMs在数学极值问题上的表现，填补现有评估的空白。

技术框架：整体架构包括数据集构建、模型选择与评估三个主要模块。数据集由93个标准化极值问题组成，模型评估涵盖多种开源模型。

关键创新：最重要的创新在于创建了一个专门针对极值问题的基准数据集ExtremBench，提供了更具针对性的评估标准，与现有数学基准相比，能够更好地反映LLMs的推理能力。

关键设计：在数据集构建中，采用了中国数学奥林匹克的习题，确保问题的多样性和挑战性；在模型评估中，选择了Qwen3、GPT-OSS和DeepSeek等多种前沿模型进行对比。

🖼️ 关键图片

📊 实验亮点

实验结果表明，LLMs在极值求解能力与传统数学基准（如AIME25和MATH-500）之间存在显著差异，部分模型在一般数学推理上表现优异，但在极值求解上却表现不佳，反映了评估方法的不足。

🎯 应用场景

该研究的潜在应用领域包括教育、自动化决策、资源优化等。通过提升LLMs在极值问题求解中的能力，可以为复杂问题的解决提供更高效的工具，推动智能系统在实际应用中的发展与普及。

📄 摘要（原文）

Test-time scaling has enabled Large Language Models (LLMs) with remarkable reasoning capabilities, particularly in mathematical domains, through intermediate chain-of-thought (CoT) reasoning before generating final answers. However, the specific sources and mechanisms underlying these reasoning capabilities remain insufficiently understood. Optimization reasoning, i.e. finding extrema under constraints, represents a fundamental abstraction that underpins critical applications in planning, control, resource allocation, and prompt search. To systematically evaluate this capability, we introduce ExtremBench, a benchmark dataset for solving mathematical extremal problems, curated from inequality exercises used for Chinese Mathematical Olympiad and transformed into $93$ standardized extrema-finding problems. We conduct extensive evaluations across various state-of-the-art open-source model families, including the Qwen3, GPT-OSS, and DeepSeek. Our results reveal that LLMs' extremal-solving reasoning capabilities do not always align with those of current mathematical benchmarks such as AIME25 and MATH-500, with some models showing strong general mathematical reasoning but poor extremal-solving skills, and vice versa. This discrepancy highlights a critical gap in current evaluation practices and suggests that existing benchmarks may not comprehensively capture the full spectrum of mathematical reasoning abilities.

Max It or Miss It: Benchmarking LLM On Solving Extremal Problems

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理