Diffusion Transformers as Open-World Spatiotemporal Foundation Models

作者: Yuan Yuan, Chonghua Han, Jingtao Ding, Guozhen Zhang, Depeng Jin, Yong Li

分类: cs.LG, cs.AI

发布日期: 2024-11-19 (更新: 2025-10-20)

备注: Accepted by NeurIPS 2025

🔗 代码/项目: GITHUB

💡 一句话要点

UrbanDiT：基于扩散Transformer的开放世界时空基础模型，用于城市环境建模。

🎯 匹配领域: 支柱八：物理动画 (Physics-based Animation) 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 时空预测 扩散模型 Transformer 提示学习 城市计算

📋 核心要点

现有城市时空建模方法难以有效整合异构数据，且泛化能力有限，无法适应开放世界场景。
UrbanDiT通过扩散Transformer和提示学习框架，统一处理多种数据类型，并自适应地生成任务特定提示。
实验表明，UrbanDiT在多种城市时空任务上表现出色，其零样本能力超越了许多需要训练数据的基线模型。

📝 摘要（中文）

本文提出了UrbanDiT，一个用于开放世界城市时空学习的基础模型，它成功地在该领域扩展了扩散Transformer。UrbanDiT率先提出了一个统一的模型，集成了多样的数据源和类型，同时学习不同城市和场景中的通用时空模式。这使得该模型能够统一多数据和多任务学习，并有效地支持广泛的时空应用。其关键创新在于精心设计的提示学习框架，该框架自适应地生成数据驱动和任务特定的提示，引导模型在各种城市应用中提供卓越的性能。UrbanDiT具有三个优点：1）它将基于网格和基于图等多种数据类型统一为序列格式；2）通过任务特定的提示，它支持广泛的任务，包括双向时空预测、时间插值、空间外推和时空插补；3）它能有效地推广到开放世界场景，其强大的零样本能力优于几乎所有具有训练数据的基线。UrbanDiT为城市时空领域的基础模型建立了一个新的基准。代码和数据集已在https://github.com/tsinghua-fib-lab/UrbanDiT上公开发布。

🔬 方法详解

问题定义：城市环境的时空动态复杂，现有方法难以有效建模和预测。具体痛点包括：无法有效整合来自不同来源和类型的数据（如网格数据、图数据），泛化能力不足，难以适应新的城市和场景，以及难以同时支持多种时空任务（如预测、插值、外推等）。

核心思路：UrbanDiT的核心思路是利用扩散Transformer强大的建模能力，学习城市时空数据的通用表示。通过统一的数据格式和提示学习框架，模型可以灵活地适应不同的数据类型和任务，从而实现多数据、多任务的统一学习。这种设计旨在提高模型的泛化能力，使其能够应用于开放世界的城市时空场景。

技术框架：UrbanDiT的整体架构包括以下几个主要模块：1) 数据编码模块：将不同类型的数据（如网格数据、图数据）编码为统一的序列格式。2) 扩散Transformer模块：利用扩散Transformer学习时空数据的通用表示。3) 提示学习模块：根据不同的任务，自适应地生成任务特定的提示，引导模型进行预测。4) 解码模块：将模型的输出解码为特定任务所需的格式。

关键创新：UrbanDiT最重要的技术创新点在于其提示学习框架。该框架能够自适应地生成数据驱动和任务特定的提示，从而引导模型更好地完成各种时空任务。与现有方法相比，UrbanDiT的提示学习框架更加灵活和高效，能够更好地利用数据中的信息，并适应不同的任务需求。

关键设计：UrbanDiT的关键设计包括：1) 统一的数据编码方式，将不同类型的数据转换为序列格式，方便模型处理。2) 精心设计的扩散Transformer结构，能够有效地学习时空数据的依赖关系。3) 自适应的提示学习框架，能够根据不同的任务生成合适的提示。具体的参数设置、损失函数和网络结构等细节在论文中有详细描述，但此处无法得知。

🖼️ 关键图片

📊 实验亮点

UrbanDiT在多个城市时空任务上取得了显著的性能提升。例如，在交通流量预测任务中，UrbanDiT的预测精度优于现有基线模型。更重要的是，UrbanDiT的零样本能力非常强大，在没有经过特定城市数据训练的情况下，仍然能够取得接近甚至超过使用训练数据的基线模型的性能。这表明UrbanDiT具有很强的泛化能力，能够应用于新的城市和场景。

🎯 应用场景

UrbanDiT可应用于智慧城市建设的多个领域，例如交通流量预测、空气质量监测、人群流动分析等。该模型能够帮助城市管理者更好地理解城市运行规律，优化资源配置，提高城市服务水平，并为城市规划提供科学依据。未来，UrbanDiT有望成为城市大脑的核心组成部分，推动城市智能化发展。

📄 摘要（原文）

The urban environment is characterized by complex spatio-temporal dynamics arising from diverse human activities and interactions. Effectively modeling these dynamics is essential for understanding and optimizing urban systems. In this work, we introduce UrbanDiT, a foundation model for open-world urban spatio-temporal learning that successfully scales up diffusion transformers in this field. UrbanDiT pioneers a unified model that integrates diverse data sources and types while learning universal spatio-temporal patterns across different cities and scenarios. This allows the model to unify both multi-data and multi-task learning, and effectively support a wide range of spatio-temporal applications. Its key innovation lies in the elaborated prompt learning framework, which adaptively generates both data-driven and task-specific prompts, guiding the model to deliver superior performance across various urban applications. UrbanDiT offers three advantages: 1) It unifies diverse data types, such as grid-based and graph-based data, into a sequential format; 2) With task-specific prompts, it supports a wide range of tasks, including bi-directional spatio-temporal prediction, temporal interpolation, spatial extrapolation, and spatio-temporal imputation; and 3) It generalizes effectively to open-world scenarios, with its powerful zero-shot capabilities outperforming nearly all baselines with training data. UrbanDiT sets up a new benchmark for foundation models in the urban spatio-temporal domain. Code and datasets are publicly available at https://github.com/tsinghua-fib-lab/UrbanDiT.

Diffusion Transformers as Open-World Spatiotemporal Foundation Models

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理