CityBench: Evaluating the Capabilities of Large Language Models for Urban Tasks

作者: Jie Feng, Jun Zhang, Tianhui Liu, Xin Zhang, Tianjian Ouyang, Junbo Yan, Yuwei Du, Siqi Guo, Yong Li

分类: cs.AI, cs.CL, cs.LG

发布日期: 2024-06-20 (更新: 2025-05-31)

备注: Accepted by KDD 2025 D&B Track, https://github.com/tsinghua-fib-lab/CityBench

💡 一句话要点

CityBench：构建城市任务评估基准，系统评估大语言模型在城市研究中的能力

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 城市任务评估 大语言模型 视觉-语言模型 城市模拟 智慧城市

📋 核心要点

现有方法缺乏系统且可扩展的城市任务评估基准，难以有效评估LLMs和VLMs在城市环境中的能力。
论文构建了CityBench，一个基于交互式模拟器的评估平台，包含CityData和CitySimu，模拟城市动态。
实验结果表明，LLMs和VLMs在需要常识和语义理解的城市任务中表现良好，但在专业知识和数值能力方面存在不足。

📝 摘要（中文）

随着大型语言模型（LLMs）的不断发展和广泛应用，建立系统且可靠的评估方法对于确保LLMs和视觉-语言模型（VLMs）在现实世界中的有效性和可靠性至关重要。尽管早期已有一些关于LLMs在有限城市任务中可用性的探索，但仍然缺乏系统且可扩展的评估基准。构建城市研究的系统评估基准的挑战在于城市数据的多样性、应用场景的复杂性以及城市环境的高度动态性。本文设计了CityBench，一个基于交互式模拟器的评估平台，作为第一个系统基准，用于评估LLMs在城市研究中各种任务的能力。首先，我们构建了CityData来整合多样化的城市数据，并构建了CitySimu来模拟细粒度的城市动态。基于CityData和CitySimu，我们设计了8个具有代表性的城市任务，分为感知理解和决策制定两类，作为CityBench。通过对来自全球13个城市的30个知名LLMs和VLMs的大量结果进行分析，我们发现先进的LLMs和VLMs在需要常识和语义理解能力的各种城市任务中可以取得有竞争力的性能，例如，理解人类动态和城市图像的语义推理。同时，它们无法解决需要专业知识和高级数值能力的具有挑战性的城市任务，例如，地理空间预测和交通控制任务。

🔬 方法详解

问题定义：现有的大语言模型（LLMs）和视觉-语言模型（VLMs）在城市任务中的应用潜力巨大，但缺乏一个系统性的评估基准来衡量它们在复杂城市环境中的表现。现有的评估方法难以覆盖城市数据的多样性、应用场景的复杂性以及城市环境的动态性，阻碍了LLMs和VLMs在城市研究中的有效应用。

核心思路：论文的核心思路是构建一个交互式的城市模拟环境，即CityBench，它能够模拟真实的城市数据和动态，并提供一系列具有代表性的城市任务，从而全面评估LLMs和VLMs在城市环境中的能力。通过这种方式，可以更准确地了解LLMs和VLMs在城市任务中的优势和局限性。

技术框架：CityBench主要包含三个模块：CityData、CitySimu和CityBench任务集。CityData负责整合多样化的城市数据，包括地理信息、交通数据、人口统计数据等。CitySimu负责模拟细粒度的城市动态，例如交通流量、人群移动等。CityBench任务集包含8个具有代表性的城市任务，分为感知理解和决策制定两类。整体流程是，LLMs或VLMs接收CityData和CitySimu提供的城市环境信息，然后根据CityBench任务的要求进行推理和决策，最后评估其性能。

关键创新：CityBench的关键创新在于其系统性和可扩展性。它不仅提供了一个包含多种城市数据的综合数据集，还构建了一个能够模拟城市动态的仿真环境。此外，CityBench任务集涵盖了城市研究中的多个重要任务，可以全面评估LLMs和VLMs在城市环境中的能力。与现有方法相比，CityBench更具代表性和实用性。

关键设计：CityData的设计考虑了城市数据的多样性和异构性，采用了统一的数据格式和存储方式。CitySimu的设计采用了基于代理的建模方法，可以模拟个体行为和群体行为之间的相互作用。CityBench任务集的设计考虑了任务的难度和代表性，涵盖了感知理解和决策制定两类任务，并设置了明确的评估指标。

🖼️ 关键图片

fig_0

fig_1

fig_2

📊 实验亮点

实验结果表明，先进的LLMs和VLMs在需要常识和语义理解的城市任务中表现出竞争力，例如理解人类动态和城市图像的语义推理。然而，在需要专业知识和高级数值能力的城市任务中，例如地理空间预测和交通控制，它们的性能仍然不足。该研究揭示了LLMs和VLMs在城市研究中的优势和局限性。

🎯 应用场景

该研究成果可应用于智慧城市建设、城市规划、交通管理、公共安全等领域。通过CityBench评估LLMs和VLMs在城市任务中的能力，可以为城市管理者提供决策支持，优化城市资源配置，提高城市运行效率，并促进城市可持续发展。

📄 摘要（原文）

As large language models (LLMs) continue to advance and gain widespread use, establishing systematic and reliable evaluation methodologies for LLMs and vision-language models (VLMs) has become essential to ensure their real-world effectiveness and reliability. There have been some early explorations about the usability of LLMs for limited urban tasks, but a systematic and scalable evaluation benchmark is still lacking. The challenge in constructing a systematic evaluation benchmark for urban research lies in the diversity of urban data, the complexity of application scenarios and the highly dynamic nature of the urban environment. In this paper, we design \textit{CityBench}, an interactive simulator based evaluation platform, as the first systematic benchmark for evaluating the capabilities of LLMs for diverse tasks in urban research. First, we build \textit{CityData} to integrate the diverse urban data and \textit{CitySimu} to simulate fine-grained urban dynamics. Based on \textit{CityData} and \textit{CitySimu}, we design 8 representative urban tasks in 2 categories of perception-understanding and decision-making as the \textit{CityBench}. With extensive results from 30 well-known LLMs and VLMs in 13 cities around the world, we find that advanced LLMs and VLMs can achieve competitive performance in diverse urban tasks requiring commonsense and semantic understanding abilities, e.g., understanding the human dynamics and semantic inference of urban images. Meanwhile, they fail to solve the challenging urban tasks requiring professional knowledge and high-level numerical abilities, e.g., geospatial prediction and traffic control task.