MazeEval: A Benchmark for Testing Sequential Decision-Making in Language Models

作者: Hafsteinn Einarsson

分类: cs.AI

发布日期: 2025-07-27

💡 一句话要点

提出MazeEval基准以评估语言模型的空间推理能力

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 空间推理 大型语言模型 迷宫导航 基准测试 自主系统 语言模型评估 跨语言能力 函数调用接口

📋 核心要点

当前研究缺乏对大型语言模型在没有视觉线索情况下的空间导航能力的评估，限制了其在自主系统中的应用。
本文提出MazeEval基准，通过坐标基础的迷宫导航任务，专注于评估LLMs的空间推理能力，排除视觉输入的干扰。
实验结果显示，OpenAI的O3在复杂迷宫中表现优异，而其他模型在较大迷宫中出现严重失败，尤其在冰岛语中表现更差。

📝 摘要（中文）

随着大型语言模型（LLMs）在机器人和具身人工智能中的应用日益增多，理解其空间推理能力对于确保其在现实世界中的可靠部署至关重要。尽管语言理解方面取得了进展，但当前研究缺乏对LLMs在没有视觉线索的情况下进行空间导航的评估。本文通过引入MazeEval基准，旨在隔离和评估LLMs在坐标基础迷宫导航任务中的纯空间推理能力。我们的方法采用函数调用接口，模型在不同复杂度的迷宫中（$5 imes 5$到$15 imes 15$网格）仅使用坐标反馈和距离墙壁信息进行导航，排除视觉输入以测试基本的空间认知。我们评估了八种最先进的LLMs在英语和冰岛语中的表现，发现显著差异：OpenAI的O3在$30 imes 30$的迷宫中实现完美导航，而其他模型在超过$9 imes 9$的迷宫中表现不佳，100%的失败归因于过度循环行为。冰岛语的表现显著下降，模型解决的迷宫比英语小3-4个尺寸，表明LLMs的空间推理源于语言模式而非语言无关机制。这些结果对LLM驱动的自主系统的全球部署具有重要意义，显示空间智能仍然受到训练数据可用性的限制，并强调了在语言环境中实现可靠导航所需的架构创新。

🔬 方法详解

问题定义：本文旨在解决大型语言模型在没有视觉线索情况下的空间推理能力评估问题。现有方法未能有效测试LLMs在有限感知信息下的导航能力，导致其在实际应用中的可靠性不足。

核心思路：论文的核心思路是通过MazeEval基准，设计坐标基础的迷宫导航任务，专注于评估LLMs的空间推理能力，排除视觉输入的影响，确保测试的纯粹性。

技术框架：整体架构包括一个函数调用接口，模型通过该接口接收坐标反馈和距离墙壁的信息，进行迷宫导航。迷宫的复杂度从$5 imes 5$到$15 imes 15$不等，确保了评估的多样性和全面性。

关键创新：最重要的技术创新点在于引入了MazeEval基准，专注于空间推理的评估，填补了现有研究的空白。与现有方法相比，该基准能够更准确地测量LLMs在空间导航中的能力。

关键设计：在实验中，模型的参数设置和损失函数设计经过精心调整，以确保在不同语言环境下的评估一致性。特别是，模型在处理冰岛语时的表现显著低于英语，提示了语言模式对空间推理的影响。

🖼️ 关键图片

📊 实验亮点

实验结果显示，OpenAI的O3在$30 imes 30$的迷宫中实现了完美导航，而其他模型在超过$9 imes 9$的迷宫中表现不佳，出现100%的失败率，主要由于过度循环行为。此外，冰岛语的表现显著低于英语，解决的迷宫尺寸小3-4个等级，强调了语言对空间推理的影响。

🎯 应用场景

该研究的潜在应用领域包括自主机器人导航、智能助手和其他需要空间推理的人工智能系统。通过提高LLMs在空间推理方面的能力，可以增强其在复杂环境中的决策能力，从而推动智能系统的实际应用和发展。

📄 摘要（原文）

As Large Language Models (LLMs) increasingly power autonomous agents in robotics and embodied AI, understanding their spatial reasoning capabilities becomes crucial for ensuring reliable real-world deployment. Despite advances in language understanding, current research lacks evaluation of how LLMs perform spatial navigation without visual cues, a fundamental requirement for agents operating with limited sensory information. This paper addresses this gap by introducing MazeEval, a benchmark designed to isolate and evaluate pure spatial reasoning in LLMs through coordinate-based maze navigation tasks. Our methodology employs a function-calling interface where models navigate mazes of varying complexity ($5\times 5$ to $15\times 15$ grids) using only coordinate feedback and distance-to-wall information, excluding visual input to test fundamental spatial cognition. We evaluate eight state-of-the-art LLMs across identical mazes in both English and Icelandic to assess cross-linguistic transfer of spatial abilities. Our findings reveal striking disparities: while OpenAI's O3 achieves perfect navigation for mazes up to size $30\times 30$, other models exhibit catastrophic failure beyond $9\times 9$ mazes, with 100% of failures attributed to excessive looping behavior where models revisit a cell at least 10 times. We document a significant performance degradation in Icelandic, with models solving mazes 3-4 sizes smaller than in English, suggesting spatial reasoning in LLMs emerges from linguistic patterns rather than language-agnostic mechanisms. These results have important implications for global deployment of LLM-powered autonomous systems, showing spatial intelligence remains fundamentally constrained by training data availability and highlighting the need for architectural innovations to achieve reliable navigation across linguistic contexts.

MazeEval: A Benchmark for Testing Sequential Decision-Making in Language Models

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理