Nondeterministic Polynomial-time Problem Challenge: An Ever-Scaling Reasoning Benchmark for LLMs

作者: Chang Yang, Ruiyu Wang, Junzhe Jiang, Qi Jiang, Qinggang Zhang, Yanchen Deng, Shuxin Li, Shuyue Hu, Bo Li, Florian T. Pokorny, Xiao Huang, Xinrun Wang

分类: cs.AI, cs.CL

发布日期: 2025-04-15

备注: Preliminary work, 10 pages for main text

💡 一句话要点

提出NPPC：一个可无限扩展的推理基准，用于评估大型语言模型在NP问题上的能力。

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 大型语言模型 推理基准 NP完全问题 可扩展性 自动评估

📋 核心要点

现有LLM基准存在易被破解和快速失效的问题，无法有效评估LLM的推理能力。
NPPC通过提供可无限扩展的NP完全问题实例，构建了一个难以破解且可自动验证的基准。
实验表明，NPPC能有效降低先进LLM的性能，并揭示了LLM在解决NP问题时的token使用和推理模式。

📝 摘要（中文）

本文提出了非确定性多项式时间问题挑战（NPPC），这是一个为大型语言模型（LLMs）设计的、可无限扩展的推理基准。NPPC旨在解决现有基准易被破解和快速失效的问题。它包含三个主要模块：npgym，提供25个NP完全问题的统一接口，并能生成任意数量和复杂度的实例；npsolver，提供统一接口，通过API和本地部署评估在线和离线模型；npeval，提供全面的工具，分析LLMs在不同问题上的表现，包括token数量、顿悟时刻、推理错误和解决方案错误。实验表明，NPPC能有效降低先进LLMs的性能至10%以下，表明其难以被破解。DeepSeek-R1、Claude-3.7-Sonnet和o1/o3-mini是最强大的LLMs，其中DeepSeek-R1在大多数NP完全问题上优于Claude-3.7-Sonnet和o1/o3-mini。当问题难度增加时，Claude-3.7-Sonnet和DeepSeek-R1等模型的token数量和顿悟时刻先增加后减少。NPPC是首个可无限扩展的推理基准，为LLMs迈向通用人工智能（AGI）提供了一个坚实的测试平台。

🔬 方法详解

问题定义：现有的大型语言模型（LLMs）推理基准存在两个主要问题：一是这些基准很容易在短时间内被攻破（不到一年），二是这些基准容易被“hack”。因此，需要一个能够持续扩展、难以攻破、难以破解、可自动验证且通用的基准来更有效地评估LLMs的推理能力。

核心思路：本文的核心思路是利用NP完全问题的特性，构建一个可以无限扩展的推理基准。NP完全问题具有实例生成简单但求解困难的特点，因此可以根据需要生成任意数量和复杂度的实例，从而保证基准的持续有效性。同时，由于NP完全问题的解具有可验证性，因此可以自动验证LLMs给出的答案是否正确。

技术框架：NPPC包含三个主要模块：npgym、npsolver和npeval。npgym提供25个NP完全问题的统一接口，可以生成任意数量和复杂度的实例。npsolver提供统一接口，通过API和本地部署评估在线和离线模型。npeval提供全面的工具，分析LLMs在不同问题上的表现，包括token数量、顿悟时刻、推理错误和解决方案错误。整体流程是：首先使用npgym生成NP完全问题实例，然后使用npsolver调用LLM进行求解，最后使用npeval分析LLM的性能。

关键创新：NPPC的关键创新在于提出了“可无限扩展性”的概念，并将其应用于推理基准的构建。通过利用NP完全问题的特性，NPPC可以根据需要生成任意数量和复杂度的实例，从而保证基准的持续有效性。此外，NPPC还提供了全面的评估工具，可以深入分析LLMs在解决NP问题时的推理过程和错误类型。

关键设计：npgym模块的关键设计在于提供了一个统一的接口，使得用户可以方便地生成各种NP完全问题的实例，并控制实例的复杂度。npsolver模块的关键设计在于支持在线和离线两种评估方式，使得用户可以根据需要选择合适的评估方式。npeval模块的关键设计在于提供了多种评估指标，包括token数量、顿悟时刻、推理错误和解决方案错误，从而可以全面分析LLMs的性能。

🖼️ 关键图片

📊 实验亮点

实验结果表明，NPPC能有效降低先进LLMs的性能至10%以下，证明了其难以被破解。DeepSeek-R1在大多数NP完全问题上优于Claude-3.7-Sonnet和o1/o3-mini，显示了其强大的推理能力。此外，研究发现，当问题难度增加时，Claude-3.7-Sonnet和DeepSeek-R1等模型的token数量和顿悟时刻先增加后减少，揭示了LLM在解决复杂问题时的推理模式。

🎯 应用场景

NPPC可用于评估和比较不同LLM的推理能力，指导LLM的训练和优化。此外，NPPC还可以用于研究LLM在解决复杂问题时的推理过程和错误类型，从而为开发更强大的LLM提供理论基础。该基准的推出，有助于推动通用人工智能的发展。

📄 摘要（原文）

Reasoning is the fundamental capability of large language models (LLMs). Due to the rapid progress of LLMs, there are two main issues of current benchmarks: i) these benchmarks can be crushed in a short time (less than 1 year), and ii) these benchmarks may be easily hacked. To handle these issues, we propose the ever-scalingness for building the benchmarks which are uncrushable, unhackable, auto-verifiable and general. This paper presents Nondeterministic Polynomial-time Problem Challenge (NPPC), an ever-scaling reasoning benchmark for LLMs. Specifically, the NPPC has three main modules: i) npgym, which provides a unified interface of 25 well-known NP-complete problems and can generate any number of instances with any levels of complexities, ii) npsolver: which provides a unified interface to evaluate the problem instances with both online and offline models via APIs and local deployments, respectively, and iii) npeval: which provides the comprehensive and ready-to-use tools to analyze the performances of LLMs over different problems, the number of tokens, the aha moments, the reasoning errors and the solution errors. Extensive experiments over widely-used LLMs demonstrate: i) NPPC can successfully decrease the performances of advanced LLMs' performances to below 10%, demonstrating that NPPC is uncrushable, ii) DeepSeek-R1, Claude-3.7-Sonnet, and o1/o3-mini are the most powerful LLMs, where DeepSeek-R1 outperforms Claude-3.7-Sonnet and o1/o3-mini in most NP-complete problems considered, and iii) the numbers of tokens, aha moments in the advanced LLMs, e.g., Claude-3.7-Sonnet and DeepSeek-R1, are observed first to increase and then decrease when the problem instances become more and more difficult. We believe that NPPC is the first ever-scaling reasoning benchmark, serving as the uncrushable and unhackable testbed for LLMs toward artificial general intelligence (AGI).

Nondeterministic Polynomial-time Problem Challenge: An Ever-Scaling Reasoning Benchmark for LLMs

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理