Learning Spatiotemporal Sensitivity in Video LLMs via Counterfactual Reinforcement Learning

作者: Dazhao Du, Jian Liu, Jialong Qin, Tao Han, Bohai Gu, Fangqi Zhu, Yujia Zhang, Eric Liu, Xi Chen, Song Guo

分类: cs.CV, cs.AI

发布日期: 2026-05-21

备注: Project website: https://ddz16.github.io/crpo.github.io/

🔗 代码/项目: PROJECT_PAGE

💡 一句话要点

提出CRPO以提升视频LLMs的时空敏感性问题

🎯 匹配领域: 支柱二：RL算法与架构 (RL & Architecture) 支柱八：物理动画 (Physics-based Animation) 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 视频理解 反事实学习 强化学习 时空动态 多模态学习 模型优化 视频LLMs

📋 核心要点

现有视频LLMs在回答问题时常依赖捷径，未能有效捕捉时空动态，导致性能不足。
本文提出CRPO框架，通过构建反事实视频和引入CRR，提升模型对时空动态的敏感性。
实验结果显示，CRPO在DyBench的准确率提升了7.7%，在TimeBlind评估中提升了8.2%，表明其有效性。

📝 摘要（中文）

视频大型语言模型（Video LLMs）在基准测试中表现出色，但往往通过单帧线索和语言先验等捷径回答视频问题，而未能有效追踪时空动态。为了解决这一问题，本文提出了反事实关系策略优化（CRPO），通过构建反事实视频并引入反事实关系奖励（CRR），增强模型对动态问题的敏感性。实验结果表明，CRPO在时空敏感性评估中优于现有强化学习方法，并在保持视频整体性能的同时，显著提升了DyBench的准确率。

🔬 方法详解

问题定义：本文旨在解决视频LLMs在回答问题时未能有效追踪时空动态的问题。现有方法往往依赖于单帧线索和语言先验，导致模型在动态问题上的表现不足。

核心思路：论文提出的CRPO框架通过引入反事实视频和反事实关系奖励（CRR），鼓励模型在动态问题上给出变化的答案，而在静态问题上保持答案不变，从而增强时空敏感性。

技术框架：CRPO采用双分支强化学习架构，分别训练原始视频和反事实视频。通过水平翻转和时间反转生成反事实视频，并在两个分支之间引入CRR进行训练。

关键创新：CRPO的核心创新在于引入反事实关系奖励（CRR），这一设计使得模型在两个分支上难以通过捷径策略获得一致的奖励，从而有效提升时空敏感性。

关键设计：在模型训练中，CRR的设计确保了动态问题的答案应变化，而静态问题的答案应保持不变。此外，DyBench基准的引入也为评估模型的时空敏感性提供了严格的标准。

🖼️ 关键图片

📊 实验亮点

实验结果表明，CRPO在DyBench基准上提升了7.7%的配对准确率（P-Acc），在TimeBlind评估中提升了8.2%的识别准确率（I-Acc），显示出其在时空敏感性方面的显著改进，且在整体视频性能上保持竞争力。

🎯 应用场景

该研究的潜在应用领域包括视频理解、智能监控、自动驾驶等场景，能够提升模型在复杂动态环境下的决策能力。未来，该方法可能推动视频分析技术的发展，促进多模态学习的进一步应用。

📄 摘要（原文）

Video large language models (Video LLMs) achieve strong benchmark accuracy, yet often answer video questions through shortcuts such as single-frame cues and language priors rather than by tracking spatiotemporal dynamics. This issue is exacerbated in RL post-training, where correctness-only rewards can further reinforce shortcut policies that obtain high reward without tracking video dynamics. We address this by asking a controlled counterfactual question: if the visual world changed while the question remained fixed, should the answer change or stay the same? Based on this view, we propose \textbf{Counterfactual Relational Policy Optimization (CRPO)}, a dual-branch RL framework for improving \emph{spatiotemporal sensitivity}. CRPO constructs counterfactual videos through horizontal flips and temporal reversals, trains on both original and counterfactual branches, and introduces a \textbf{Counterfactual Relation Reward (CRR)} between their answers. CRR encourages answers to change for dynamic questions and remain unchanged for static questions. This cross-branch constraint makes it difficult for shortcut policies to be consistently rewarded across both branches. To evaluate this property, we introduce \textbf{DyBench}, a paired counterfactual video benchmark with 3,014 videos covering reversible dynamics, moving direction, and event sequence, together with a strict pair-accuracy metric that prevents fixed-answer shortcuts from inflating scores. Experiments show that CRPO outperforms prior RL methods on spatiotemporal-sensitive evaluations while maintaining competitive general video performance. On Qwen3-VL-8B, CRPO improves DyBench P-Acc by +7.7 and TimeBlind I-Acc by +8.2 over the base model, indicating improved spatiotemporal sensitivity rather than stronger reliance on static shortcuts. The project website can be found at https://ddz16.github.io/crpo.github.io/ .

Learning Spatiotemporal Sensitivity in Video LLMs via Counterfactual Reinforcement Learning

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理