PatchTraj: Unified Time-Frequency Representation Learning via Dynamic Patches for Trajectory Prediction

作者: Yanghong Liu, Xingping Dong, Ming Li, Weixing Zhang, Yidong Lou

分类: cs.CV, cs.AI

发布日期: 2025-07-25 (更新: 2025-07-31)

💡 一句话要点

PatchTraj：通过动态patches和时频联合表示学习进行轨迹预测

🎯 匹配领域: 支柱二：RL算法与架构 (RL & Architecture) 支柱六：视频提取与匹配 (Video Extraction) 支柱八：物理动画 (Physics-based Animation)

关键词: 轨迹预测 时频分析 动态Patch 多尺度学习 跨模态融合 Transformer 具身智能

📋 核心要点

现有基于点或网格的轨迹预测方法，难以平衡局部运动细节与长程时空依赖关系，对人类运动动态建模不足。
PatchTraj通过动态patch划分进行多尺度分割，并结合时频联合建模，从而捕捉分层运动模式，提升轨迹预测精度。
在多个数据集上实验表明，PatchTraj达到了SOTA性能，尤其在JRDB数据集上，ADE和FDE分别提升了26.7%和17.4%。

📝 摘要（中文）

本文提出PatchTraj，一个动态patch框架，用于轨迹预测，它集成了时频联合建模。该方法将轨迹分解为原始时间序列和频率分量，并采用动态patch划分进行多尺度分割，捕捉分层运动模式。每个patch通过尺度感知特征提取进行自适应嵌入，然后进行分层特征聚合，以建模细粒度和长程依赖关系。通过跨模态注意力增强两个分支的输出，促进时间和频谱线索的互补融合。由此产生的增强嵌入表现出强大的表达能力，即使使用简单的Transformer架构也能实现准确的预测。在ETH-UCY、SDD、NBA和JRDB数据集上的大量实验表明，该方法达到了最先进的性能。值得注意的是，在以自我为中心的JRDB数据集上，PatchTraj在ADE和FDE方面分别实现了26.7%和17.4%的显著相对改进，突显了其在具身智能方面的巨大潜力。

🔬 方法详解

问题定义：现有轨迹预测方法主要存在两个痛点：一是无法充分建模人类运动的动态性，难以兼顾局部运动细节和长程时空依赖；二是时间表示缺乏与其频率分量的交互，难以联合建模轨迹序列。

核心思路：PatchTraj的核心在于利用动态patch划分来捕捉多尺度的运动模式，并结合时频联合建模，从而更全面地理解轨迹数据。通过将轨迹分解为时域和频域信息，并进行融合，可以更好地捕捉轨迹的动态特征。

技术框架：PatchTraj框架主要包括以下几个阶段：1) 轨迹分解：将轨迹分解为原始时间序列和频率分量。2) 动态Patch划分：对时间和频率分量分别进行动态patch划分，实现多尺度分割。3) 特征提取与嵌入：对每个patch进行尺度感知的特征提取，并进行自适应嵌入。4) 分层特征聚合：对提取的特征进行分层聚合，建模细粒度和长程依赖关系。5) 跨模态注意力融合：通过跨模态注意力机制，融合时间和频率分支的特征。6) 轨迹预测：使用Transformer架构进行最终的轨迹预测。

关键创新：PatchTraj的关键创新在于：1) 动态Patch划分：能够自适应地分割轨迹，捕捉不同尺度的运动模式。2) 时频联合建模：同时考虑轨迹的时域和频域信息，更全面地理解轨迹的动态特征。3) 跨模态注意力融合：有效地融合时间和频率分支的特征，提升预测精度。

关键设计：动态patch的大小是根据轨迹的局部变化自适应确定的，损失函数采用常用的均方误差（MSE）损失，Transformer的层数和隐藏层维度根据数据集大小进行调整。

🖼️ 关键图片

📊 实验亮点

PatchTraj在ETH-UCY、SDD、NBA和JRDB等多个数据集上取得了SOTA性能。尤其在以自我为中心的JRDB数据集上，ADE指标相对提升了26.7%，FDE指标相对提升了17.4%。这些结果表明，PatchTraj能够有效地建模人类运动的动态性，并显著提升轨迹预测的准确性。

🎯 应用场景

PatchTraj在自动驾驶、机器人导航、智能监控等领域具有广泛的应用前景。准确的行人轨迹预测可以提高自动驾驶车辆的安全性，帮助机器人更好地规划路径，并为智能监控系统提供更可靠的分析结果。该研究对于提升具身智能水平具有重要意义。

📄 摘要（原文）

Pedestrian trajectory prediction is crucial for autonomous driving and robotics. While existing point-based and grid-based methods expose two main limitations: insufficiently modeling human motion dynamics, as they fail to balance local motion details with long-range spatiotemporal dependencies, and the time representations lack interaction with their frequency components in jointly modeling trajectory sequences. To address these challenges, we propose PatchTraj, a dynamic patch-based framework that integrates time-frequency joint modeling for trajectory prediction. Specifically, we decompose the trajectory into raw time sequences and frequency components, and employ dynamic patch partitioning to perform multi-scale segmentation, capturing hierarchical motion patterns. Each patch undergoes adaptive embedding with scale-aware feature extraction, followed by hierarchical feature aggregation to model both fine-grained and long-range dependencies. The outputs of the two branches are further enhanced via cross-modal attention, facilitating complementary fusion of temporal and spectral cues. The resulting enhanced embeddings exhibit strong expressive power, enabling accurate predictions even when using a vanilla Transformer architecture. Extensive experiments on ETH-UCY, SDD, NBA, and JRDB datasets demonstrate that our method achieves state-of-the-art performance. Notably, on the egocentric JRDB dataset, PatchTraj attains significant relative improvements of 26.7% in ADE and 17.4% in FDE, underscoring its substantial potential in embodied intelligence.

PatchTraj: Unified Time-Frequency Representation Learning via Dynamic Patches for Trajectory Prediction

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理