TRAJGANR: Trajectory-Centric Urban Multimodal Learning via Geospatially Aligned Neural Representations
作者: Maria Despoina Siampou, Gengchen Mai, Ni Lao, Jinmeng Rao, Neha Arora, Cyrus Shahabi, Shushman Choudhury
分类: cs.CV, cs.LG
发布日期: 2026-05-07
💡 一句话要点
提出TrajGANR框架,通过地理空间对齐神经表征实现轨迹中心化的城市多模态学习
🎯 匹配领域: 支柱九:具身大模型 (Embodied Foundation Models)
关键词: 多模态自监督学习 地理空间基础模型 轨迹表征学习 城市计算 神经表征 跨模态对齐
📋 核心要点
- 现有地理空间多模态学习多依赖静态位置对齐,无法有效处理具有连续时空特征的人类移动轨迹数据。
- TrajGANR通过学习轨迹的连续神经表征,实现了轨迹路径与非重合街景图像及地理位置的细粒度跨模态对齐。
- 实验证明该框架在多项城市移动与道路理解任务中表现优异,优于现有的地理空间基础模型及专用轨迹模型。
📝 摘要(中文)
多模态自监督学习(MSSL)已成为地理空间基础模型预训练的关键范式。然而,现有方法主要针对卫星图像、街景和文本等静态模态,通过对齐同一或邻近位置的观测来驱动学习。这种假设在人类移动轨迹上失效,因为轨迹代表沿路径的连续运动,而非离散位置的观测。尽管轨迹对于理解城市活动至关重要,但其在现有地理空间MSSL框架中仍未得到充分探索。本文提出TrajGANR,一种以轨迹为中心的地理空间MSSL框架,旨在将连续运动模式与静态位置观测对齐。TrajGANR通过学习轨迹在任意点上的连续神经表征,实现了与邻近街景图像的细粒度对齐,即使这些图像并未与轨迹航点重合。我们利用该能力引入了一种联合对齐轨迹、街景图像及其地理位置的多模态学习目标。在四项城市移动和道路理解任务上的实验表明,TrajGANR显著优于现有框架,验证了细粒度地理空间对齐的重要性。
🔬 方法详解
问题定义:现有地理空间多模态学习(MSSL)主要处理静态观测(如卫星图、街景),无法有效建模人类轨迹这种连续的、动态的移动模式,导致轨迹数据在城市理解任务中被边缘化。
核心思路:引入“轨迹中心化”的学习范式,通过神经表征学习将离散的轨迹点映射为连续的路径函数,从而在任意空间位置与静态地理模态(如街景)建立细粒度的语义对齐。
技术框架:TrajGANR框架包含轨迹编码器、地理空间特征提取器和多模态对齐模块。系统首先将轨迹转化为连续神经表征,随后通过地理空间对齐机制,将轨迹特征与对应地理坐标下的街景特征进行映射与融合。
关键创新:核心创新在于“连续神经表征”的应用,它打破了传统方法必须在特定航点进行对齐的限制,实现了轨迹路径与周围环境信息的动态关联,提升了模型对城市空间结构的感知能力。
关键设计:设计了联合多模态损失函数,通过对比学习策略同时优化轨迹、街景图像与地理位置的嵌入空间,确保模型在多模态特征空间中保持一致的地理语义表达。
🖼️ 关键图片
📊 实验亮点
TrajGANR在四项城市移动与道路理解任务中均取得SOTA表现。对比实验显示,其在处理稀疏轨迹数据时,通过细粒度对齐机制显著提升了特征提取的鲁棒性,性能优于现有的通用地理空间MSSL框架及专门的轨迹基础模型,消融实验进一步证实了多模态联合学习目标的有效性。
🎯 应用场景
该研究可广泛应用于智慧城市建设,包括城市交通流量预测、道路网络拓扑推断、基于轨迹的城市功能区识别以及个性化出行推荐系统。通过深度融合轨迹与视觉信息,该模型能为城市规划者提供更精准的微观行为分析与宏观空间演变洞察。
📄 摘要(原文)
Multimodal self-supervised learning (MSSL) has emerged as a key paradigm for pretraining geospatial foundation models. However, existing geospatial MSSL methods are mainly designed for static pairs of modalities, such as satellite imagery, street-view imagery, and text, where learning is driven by aligning observations from the same or nearby locations. This assumption breaks down for human mobility trajectories, which represent continuous movement along paths rather than discrete observations at individual locations. Although trajectories are important for urban understanding through their ability to capture human activity across roads, neighborhoods, and places over time, they remain largely underexplored in current geospatial MSSL frameworks. We present TrajGANR, a novel trajectory-centric geospatial MSSL framework that aligns continuous movement patterns with static, location-based observations. TrajGANR learns a continuous neural representation of trajectories at arbitrary points along each path, which enables fine-grained alignment with nearby street-view images, even when they are not co-located with any trajectory waypoints. We leverage this capability to introduce an MSSL objective that jointly aligns three modalities: trajectories, street-view images, and their geographic locations. We evaluate TrajGANR on four urban mobility and road understanding tasks. Across these tasks, TrajGANR consistently outperforms existing geospatial MSSL frameworks and a trajectory-specific foundation model. Ablation studies further demonstrate that our proposed MSSL objective and the multimodal learning framework are the primary drivers of these improvements, highlighting the importance of fine-grained geospatial alignment over coarser aggregation, as well as geospatial multimodal learning.