Motion-o: Trajectory-Grounded Video Reasoning
作者: Bishoy Galoaa, Shayda Moezzi, Xiangyu Bai, Sarah Ostadabbas
分类: cs.CV, cs.AI
发布日期: 2026-03-19
🔗 代码/项目: GITHUB
💡 一句话要点
提出Motion-o,通过显式轨迹推理增强视频理解中的时空推理能力。
🎯 匹配领域: 支柱九:具身大模型 (Embodied Foundation Models)
关键词: 视频理解 运动推理 轨迹预测 时空推理 视觉语言模型
📋 核心要点
- 现有视频推理模型缺乏对物体运动轨迹的显式建模,导致难以验证和理解物体间的运动关系。
- Motion-o通过引入轨迹grounding数据集和运动思维链(MCoT),显式地建模和推理物体在视频中的运动轨迹。
- 实验表明,Motion-o在时空grounding和轨迹预测方面取得了显著提升,且易于集成到现有框架中。
📝 摘要(中文)
本文针对视频推理中物体运动模式理解不足的问题,提出了空间-时间-轨迹(STT)推理的概念,并引入了Motion-o,这是一个以运动为中心的视频理解扩展,它使轨迹显式化和可验证。为了实现运动推理,本文还引入了一种轨迹-grounding数据集,通过增强稀疏关键帧监督来产生更密集的边界框轨迹和更强的轨迹级训练信号。此外,还提出了运动思维链(MCoT),这是一种结构化的推理路径,通过离散的
🔬 方法详解
问题定义:现有视频推理模型在理解物体如何随时间移动方面存在不足。虽然模型可以识别视频中的物体并建立时空关系,但它们通常忽略了物体运动的轨迹信息,使得轨迹理解隐式且难以验证。这限制了模型在需要精确运动推理的任务中的表现。
核心思路:Motion-o的核心思路是将物体运动轨迹显式地建模到视频推理过程中。通过引入轨迹grounding数据集和运动思维链(MCoT),模型能够学习并推理物体在视频中的运动轨迹,从而提高对视频内容的理解能力。这种显式建模使得轨迹更容易验证,并允许模型更好地利用运动信息进行推理。
技术框架:Motion-o建立在现有的视觉语言模型之上,无需修改模型架构。其主要组成部分包括:1) 轨迹-grounding数据集:通过增强稀疏关键帧标注,生成密集的边界框轨迹。2) 运动思维链(MCoT):一种结构化的推理路径,通过离散的
关键创新:Motion-o的关键创新在于将运动轨迹显式地建模到视频推理过程中。与以往隐式地学习运动信息的方法不同,Motion-o通过轨迹-grounding数据集和运动思维链,使模型能够直接学习和推理物体在视频中的运动轨迹。这种显式建模使得轨迹更容易验证,并允许模型更好地利用运动信息进行推理。
关键设计:轨迹-grounding数据集通过数据增强技术,从稀疏的关键帧标注生成密集的边界框轨迹。运动思维链(MCoT)使用离散的
🖼️ 关键图片
📊 实验亮点
Motion-o在空间-时间grounding和轨迹预测任务中取得了显著提升。实验结果表明,Motion-o能够有效地利用运动信息进行推理,并在多个基准数据集上优于现有方法。具体的性能提升数据在论文中给出,证明了Motion-o在运动推理方面的有效性。
🎯 应用场景
Motion-o可应用于各种需要精确运动推理的视频理解任务,例如视频监控、自动驾驶、机器人导航和运动分析。通过显式地建模物体运动轨迹,Motion-o可以提高这些应用场景中对视频内容的理解和分析能力,从而实现更智能和可靠的系统。
📄 摘要(原文)
Recent research has made substantial progress on video reasoning, with many models leveraging spatio-temporal evidence chains to strengthen their inference capabilities. At the same time, a growing set of datasets and benchmarks now provides structured annotations designed to support and evaluate such reasoning. However, little attention has been paid to reasoning about \emph{how} objects move between observations: no prior work has articulated the motion patterns by connecting successive observations, leaving trajectory understanding implicit and difficult to verify. We formalize this missing capability as Spatial-Temporal-Trajectory (STT) reasoning and introduce \textbf{Motion-o}, a motion-centric video understanding extension to visual language models that makes trajectories explicit and verifiable. To enable motion reasoning, we also introduce a trajectory-grounding dataset artifact that expands sparse keyframe supervision via augmentation to yield denser bounding box tracks and a stronger trajectory-level training signal. Finally, we introduce Motion Chain of Thought (MCoT), a structured reasoning pathway that makes object trajectories through discrete \texttt{
} tag summarizing per-object direction, speed, and scale (of velocity) change to explicitly connect grounded observations into trajectories. To train Motion-o, we design a reward function that compels the model to reason directly over visual evidence, all while requiring no architectural modifications. Empirical results demonstrate that Motion-o improves spatial-temporal grounding and trajectory prediction while remaining fully compatible with existing frameworks, establishing motion reasoning as a critical extension for evidence-based video understanding. Code is available at https://github.com/ostadabbas/Motion-o.