DynFrame: Adaptive Reasoning-Driven Multimodal Framework with Dynamic Frame Augmentation for Complex Video Understanding

📄 arXiv: 2605.26680v1 📥 PDF

作者: Peng Zhang, Guanghao Zhang, Wanggui He, Longxiang Zhang, Mushui Liu, Yan Xia, Zhenhao Peng, Weilong Dai, Jinlong Liu, Haobing Tang, Le Zhang, Hao Jiang, Pipei Huang

分类: cs.CV, cs.AI

发布日期: 2026-05-26

🔗 代码/项目: GITHUB


💡 一句话要点

提出DynFrame以解决复杂视频理解中的动态取样问题

🎯 匹配领域: 支柱九:具身大模型 (Embodied Foundation Models)

关键词: 长视频理解 动态取样 多模态融合 自回归模型 视频检索 深度学习 智能推理

📋 核心要点

  1. 现有思维视频系统在采样密度和检索优化方面存在结构性缺口,导致推理效率低下。
  2. DynFrame框架通过将时间窗口和采样密度作为可学习令牌输出,实现了多粒度证据的单次检索。
  3. DynFrame-8B在多个基准测试中表现优异,创造了新的最先进水平,显示出其强大的竞争力。

📝 摘要(中文)

近年来,视频多模态大语言模型(MLLMs)越来越多地将逐步推理与按需视觉证据检索结合起来,允许模型在推理过程中重新访问相关视频片段。然而,现有思维视频系统存在两个结构性缺口:一是采样密度不是可学习的决策,导致细粒度证据的恢复需要重复检索,增加了推理上下文长度和训练难度;二是检索和答案生成通常以单一轨迹级别的优势进行优化,导致“查看位置”与“如何回答”令牌获得相同的信用。为了解决这些问题,本文提出了DynFrame框架,该框架在单次自回归过程中将时间窗口和采样密度作为原生令牌输出,从而实现多粒度证据的单次检索。基于这一令牌化检索接口,进一步引入了Segment-Decoupled GRPO(SD-GRPO),在检索边界处拆分每个回滚,并分配角色特定的令牌级优势,分别对采样决策和答案进行信用分配。经过在DM-CoT-74k和DM-RL-45k上的训练,DynFrame-4B在六个基准测试(NExT-GQA、Charades-STA、ActivityNet-MR、Video-MME、MLVU、LVBench)中与强大的7B-8B基线竞争,DynFrame-8B在大多数指标上创造了新的最先进水平。

🔬 方法详解

问题定义:本文旨在解决现有视频理解系统在采样密度和检索优化方面的不足,尤其是固定的采样密度导致的推理效率低下和训练困难。

核心思路:提出DynFrame框架,通过将时间窗口和采样密度作为可学习的令牌输出,允许模型在单次自回归过程中获取多粒度证据,从而提高推理效率。

技术框架:DynFrame的整体架构包括两个主要模块:令牌化检索接口和Segment-Decoupled GRPO(SD-GRPO)。前者负责在单次检索中输出时间窗口和采样密度,后者则在检索边界处拆分回滚并分配角色特定的优势。

关键创新:最重要的创新在于引入了可学习的采样密度和时间窗口令牌,使得模型能够在一次检索中获取多粒度证据,显著提高了推理效率和准确性。

关键设计:在模型训练中,采用了特定的损失函数来分别优化采样决策和答案生成,确保每个令牌的信用分配是基于其实际贡献的。

🖼️ 关键图片

fig_0
fig_1
fig_2

📊 实验亮点

DynFrame-8B在多个基准测试中表现突出,创造了新的最先进水平,尤其在NExT-GQA和ActivityNet-MR等任务中,相较于7B-8B基线模型,性能提升显著,展示了其在复杂视频理解中的强大能力。

🎯 应用场景

该研究具有广泛的应用潜力,尤其是在视频理解、智能监控、自动驾驶和人机交互等领域。通过提高视频分析的效率和准确性,DynFrame能够为实时决策提供更可靠的支持,推动相关技术的发展和应用。

📄 摘要(原文)

Recent video multimodal large language models (MLLMs) increasingly couple step-by-step reasoning with on-demand visual evidence retrieval, allowing models to revisit relevant video segments during inference. However, two structural gaps remain in existing thinking-with-video systems. (i) Sampling density is not a learnable decision: existing methods may let the model decide where to look, but the per-window frame rate is largely fixed. As a result, fine-grained evidence is often recovered through repeated retrieval calls, which increases inference context length and training difficulty. (ii) Retrieval and answer generation are usually optimized with a single trajectory-level advantage, so the "where to look" tokens and the "how to answer" tokens receive the same credit even when one is correct and the other is not. To address these gaps, we present DynFrame, a framework that emits the temporal window and the sampling density as native tokens within a single autoregressive pass. This learnable span-density retrieval enables acquiring multi-granularity evidence with a single retrieval step. Based on the above tokenized retrieval interface, we further introduce Segment-Decoupled GRPO (SD-GRPO), which splits each rollout at the retrieval boundary and assigns role-specific token-level advantages, separately crediting the sampling decision and the answer. Trained on the curated DM-CoT-74k and DM-RL-45k, DynFrame-4B is competitive with strong 7B-8B baselines across six benchmarks (NExT-GQA, Charades-STA, ActivityNet-MR, Video-MME, MLVU, LVBench), and DynFrame-8B sets new state-of-the-art on most metrics. Code is available at https://github.com/zhangguanghao523/DynFrame.