Learning Spatiotemporal Sensitivity in Video LLMs via Counterfactual Reinforcement Learning

📄 arXiv: 2605.21988v1 📥 PDF

作者: Dazhao Du, Jian Liu, Jialong Qin, Tao Han, Bohai Gu, Fangqi Zhu, Yujia Zhang, Eric Liu, Xi Chen, Song Guo

分类: cs.CV, cs.AI

发布日期: 2026-05-21

备注: Project website: https://ddz16.github.io/crpo.github.io/

🔗 代码/项目: PROJECT_PAGE


💡 一句话要点

提出CRPO以提升视频LLMs的时空敏感性问题

🎯 匹配领域: 支柱二:RL算法与架构 (RL & Architecture) 支柱八:物理动画 (Physics-based Animation) 支柱九:具身大模型 (Embodied Foundation Models)

关键词: 视频理解 反事实学习 强化学习 时空动态 多模态学习 模型优化 视频LLMs

📋 核心要点

  1. 现有视频LLMs在回答问题时常依赖捷径,未能有效捕捉时空动态,导致性能不足。
  2. 本文提出CRPO框架,通过构建反事实视频和引入CRR,提升模型对时空动态的敏感性。
  3. 实验结果显示,CRPO在DyBench的准确率提升了7.7%,在TimeBlind评估中提升了8.2%,表明其有效性。

📝 摘要(中文)

视频大型语言模型(Video LLMs)在基准测试中表现出色,但往往通过单帧线索和语言先验等捷径回答视频问题,而未能有效追踪时空动态。为了解决这一问题,本文提出了反事实关系策略优化(CRPO),通过构建反事实视频并引入反事实关系奖励(CRR),增强模型对动态问题的敏感性。实验结果表明,CRPO在时空敏感性评估中优于现有强化学习方法,并在保持视频整体性能的同时,显著提升了DyBench的准确率。

🔬 方法详解

问题定义:本文旨在解决视频LLMs在回答问题时未能有效追踪时空动态的问题。现有方法往往依赖于单帧线索和语言先验,导致模型在动态问题上的表现不足。

核心思路:论文提出的CRPO框架通过引入反事实视频和反事实关系奖励(CRR),鼓励模型在动态问题上给出变化的答案,而在静态问题上保持答案不变,从而增强时空敏感性。

技术框架:CRPO采用双分支强化学习架构,分别训练原始视频和反事实视频。通过水平翻转和时间反转生成反事实视频,并在两个分支之间引入CRR进行训练。

关键创新:CRPO的核心创新在于引入反事实关系奖励(CRR),这一设计使得模型在两个分支上难以通过捷径策略获得一致的奖励,从而有效提升时空敏感性。

关键设计:在模型训练中,CRR的设计确保了动态问题的答案应变化,而静态问题的答案应保持不变。此外,DyBench基准的引入也为评估模型的时空敏感性提供了严格的标准。

🖼️ 关键图片

fig_0
fig_1
fig_2

📊 实验亮点

实验结果表明,CRPO在DyBench基准上提升了7.7%的配对准确率(P-Acc),在TimeBlind评估中提升了8.2%的识别准确率(I-Acc),显示出其在时空敏感性方面的显著改进,且在整体视频性能上保持竞争力。

🎯 应用场景

该研究的潜在应用领域包括视频理解、智能监控、自动驾驶等场景,能够提升模型在复杂动态环境下的决策能力。未来,该方法可能推动视频分析技术的发展,促进多模态学习的进一步应用。

📄 摘要(原文)

Video large language models (Video LLMs) achieve strong benchmark accuracy, yet often answer video questions through shortcuts such as single-frame cues and language priors rather than by tracking spatiotemporal dynamics. This issue is exacerbated in RL post-training, where correctness-only rewards can further reinforce shortcut policies that obtain high reward without tracking video dynamics. We address this by asking a controlled counterfactual question: if the visual world changed while the question remained fixed, should the answer change or stay the same? Based on this view, we propose \textbf{Counterfactual Relational Policy Optimization (CRPO)}, a dual-branch RL framework for improving \emph{spatiotemporal sensitivity}. CRPO constructs counterfactual videos through horizontal flips and temporal reversals, trains on both original and counterfactual branches, and introduces a \textbf{Counterfactual Relation Reward (CRR)} between their answers. CRR encourages answers to change for dynamic questions and remain unchanged for static questions. This cross-branch constraint makes it difficult for shortcut policies to be consistently rewarded across both branches. To evaluate this property, we introduce \textbf{DyBench}, a paired counterfactual video benchmark with 3,014 videos covering reversible dynamics, moving direction, and event sequence, together with a strict pair-accuracy metric that prevents fixed-answer shortcuts from inflating scores. Experiments show that CRPO outperforms prior RL methods on spatiotemporal-sensitive evaluations while maintaining competitive general video performance. On Qwen3-VL-8B, CRPO improves DyBench P-Acc by +7.7 and TimeBlind I-Acc by +8.2 over the base model, indicating improved spatiotemporal sensitivity rather than stronger reliance on static shortcuts. The project website can be found at https://ddz16.github.io/crpo.github.io/ .