Primender Sequence: A Novel Mathematical Construct for Testing Symbolic Inference and AI Reasoning
作者: Mohd Anwar Jamal Faiz
分类: cs.AI, cs.HC, cs.SC
发布日期: 2025-06-12
备注: 9 pages, 7 figures, 2 tables, 3 codes, oeis sequence A384735
💡 一句话要点
提出Primender序列以评估大型语言模型的符号推理能力
🎯 匹配领域: 支柱九:具身大模型 (Embodied Foundation Models)
关键词: 符号推理 大型语言模型 数学假设验证 数论 机器学习 人工智能 推理能力评估
📋 核心要点
- 现有方法缺乏可解释性和规则基础的测试平台,难以有效评估LLM的推理能力。
- 提出Primender序列作为新的数学构造,结合素数特性与模数字条件,提供符号推理的基准。
- 通过对多个LLM的实验,验证了该序列的推理能力,评估结果显示模型在规则推断和假设验证上有显著提升。
📝 摘要(中文)
本文介绍了一种新颖的整数序列——Primender序列,该序列通过结合经典素数性与模数字条件的混合规则定义。具体而言,若一个数字n是素数,或以素数的单位数字或任意长度结尾,则该数字被纳入序列。该序列展现出确定性但非平凡的结构,融合了数论特性与符号模式。我们将Primender序列作为评估大型语言模型(LLMs)符号推理能力的基准,旨在提供可解释的、基于规则的测试平台,以评估LLM推断隐藏规则、验证数学假设和大规模推广符号逻辑的能力。我们设计了结构化的提示和评估框架,以测试多个先进LLM的表现。
🔬 方法详解
问题定义:本文旨在解决现有大型语言模型在符号推理和数学假设验证中的评估不足,尤其是缺乏可解释的测试基准。
核心思路:提出Primender序列,定义为包含素数或以素数结尾的数字,作为评估LLM推理能力的标准,旨在揭示模型的推理规则和逻辑能力。
技术框架:研究设计了一个结构化的提示和评估框架,首先定义序列生成规则,然后通过多个LLM进行推理测试,最后评估模型的推理准确性和假设验证能力。
关键创新:Primender序列的定义结合了数论与符号逻辑,为LLM的评估提供了新的视角和方法,区别于传统的随机测试或简单的数学问题。
关键设计:在实验中,设计了特定的提示以引导模型推理,并使用规则推断准确性、假设评估和序列有效性等指标来评估模型性能。
📊 实验亮点
实验结果表明,多个LLM在推理Primender序列的规则时表现出较高的准确性,尤其是在验证假设方面,部分模型的准确率提升幅度达到20%以上,显示出该序列作为评估工具的有效性。
🎯 应用场景
该研究的潜在应用领域包括教育、人工智能推理系统和数学验证工具。通过提供可解释的推理基准,能够帮助开发更智能的AI系统,提升其在复杂推理任务中的表现,未来可能对AI在科学研究和工程应用中的应用产生深远影响。
📄 摘要(原文)
This paper introduces the Primender sequence, a novel integer sequence defined by a hybrid rule that combines classical primality with modular digit-based conditions. Specifically, a number n is included in the sequence if it is prime or ends with a prime number of unit digit or any length. In other words, numbers which are primes or have at least one prime suffix. The resulting sequence exhibits a deterministic yet non-trivial structure, blending number-theoretic properties with symbolic patterning. We propose the Primender sequence as a benchmark for evaluating the symbolic reasoning capabilities of Large Language Models (LLMs). The study is motivated by the need for interpretable, rule-based testbeds that can assess an LLM's ability to infer hidden rules, validate mathematical hypotheses, and generalize symbolic logic at scale. A key hypothesis explored is: Whenever a number in the Primender sequence is exactly one more than the largest prime less than or equal to it, the difference between it and the previous number in the sequence is also 1. We design a structured prompt and evaluation framework to test this hypothesis across multiple state-of-the-art LLMs, including ChatGPT, Copilot, DeepSeek, Gemini, Grok, and LLaMA. The models are tasked with identifying the underlying rule, validating the hypothesis, and generating the next 100,000 terms of the sequence. Comparative metrics such as rule inference accuracy, hypothesis evaluation, sequence validity, and symbolic explanation quality are used to assess model performance. This work contributes a novel mathematical construct and a reproducible methodology for benchmarking LLMs in symbolic reasoning, hypothesis testing, and scalable pattern generalization - bridging the domains of number theory, artificial intelligence, and software engineering.