VISD: Enhancing Video Reasoning via Structured Self-Distillation
作者: Hao Lin, Kunyang Lv, Xu Jiang, Jingqi Tian, Zhongjing Du, Jiayu Ding, Qiaoman Zhang, Hongbo Jin
分类: cs.CV, cs.AI
发布日期: 2026-05-07 (更新: 2026-05-08)
💡 一句话要点
提出VISD结构化自蒸馏框架,通过多维度诊断反馈提升视频大模型推理能力与训练效率
🎯 匹配领域: 支柱二:RL算法与架构 (RL & Architecture)
关键词: 视频大模型 结构化自蒸馏 强化学习 时空推理 多模态学习 信用分配
📋 核心要点
- 现有VideoLLM训练面临序列级奖励稀疏及长时序推理中细粒度信用分配缺失的难题,导致模型难以有效学习复杂的推理逻辑。
- 提出VISD框架,通过视频感知判别模型提供多维度结构化反馈,并利用方向-幅度解耦机制实现强化学习与密集监督的稳定协同。
- 实验证明VISD在多个视频推理基准上显著提升了准确率与时空定位能力,且训练收敛速度较基线方法提升了约2倍。
📝 摘要(中文)
训练视频大模型(VideoLLMs)进行复杂推理时,面临序列级奖励稀疏及长时序推理轨迹中细粒度信用分配缺失的挑战。尽管基于可验证奖励的强化学习(RLVR)提供了可靠监督,但其难以捕捉Token级贡献,导致学习效率低下。现有的自蒸馏方法虽能提供密集监督,却缺乏结构化诊断能力,且与强化学习结合时表现不稳定。为此,本文提出VISD,一种引入诊断性特权信息的结构化自蒸馏框架。VISD利用视频感知判别模型将推理质量分解为答案正确性、逻辑一致性及时空定位等维度,并以此指导教师策略进行Token级监督。通过引入方向-幅度解耦机制,VISD实现了强化学习与密集监督的稳定融合,显著提升了推理忠实度与训练效率。实验表明,VISD在多个基准测试中表现优异,收敛速度提升近2倍。
🔬 方法详解
问题定义:视频大模型在处理长时序推理任务时,由于缺乏对推理过程中每个Token贡献的细粒度评估,导致模型难以从稀疏的最终奖励中学习到有效的逻辑链,且现有自蒸馏方法在与强化学习结合时存在训练不稳定的问题。
核心思路:引入结构化特权信息,将推理质量分解为可解释的维度(如逻辑一致性、时空定位等),通过多维度反馈指导模型学习,并设计解耦机制平衡强化学习的全局方向与自蒸馏的局部幅度。
技术框架:VISD包含一个视频感知判别模型(Judge Model)用于生成结构化反馈,一个教师策略网络用于提供Token级监督,以及一个集成强化学习与自蒸馏的训练流程。系统通过滚动评估计算优势函数,并结合结构化信号进行梯度更新。
关键创新:核心创新在于“方向-幅度解耦机制”,它将强化学习的奖励优势作为更新方向,将结构化特权信号作为更新幅度,从而在保持语义对齐的同时实现了高效的细粒度信用分配。
关键设计:采用了课程学习策略以适应长视频序列,并引入基于指数移动平均(EMA)的教师模型稳定化技术,确保在长时序推理训练过程中的优化稳定性与收敛效率。
🖼️ 关键图片
📊 实验亮点
VISD在多个主流视频推理基准测试中均超越了强基线模型,不仅显著提升了答案的准确率和时空定位的精确度,更实现了训练收敛速度近2倍的提升。实验数据表明,结构化自蒸馏在处理长时序、复杂逻辑推理任务时,在样本效率和模型性能之间取得了极佳的平衡。
🎯 应用场景
该研究主要应用于视频问答、复杂动作识别及长视频理解等领域。在自动驾驶场景中,可用于分析交通参与者的行为逻辑;在视频监控与智能分析中,可提升对异常事件的时空推理准确性。其高效的训练范式为资源受限环境下的多模态大模型微调提供了重要参考。
📄 摘要(原文)
Training VideoLLMs for complex reasoning remains challenging due to sparse sequence level rewards and the lack of fine grained credit assignment over long, temporally grounded reasoning trajectories. While reinforcement learning with verifiable rewards (RLVR) provides reliable supervision, it fails to capture token level contributions, leading to inefficient learning. Conversely, existing self distillation methods offer dense supervision but lack structure and diagnostic specificity, and often interact unstably with reinforcement learning. In this work, we propose VISD, a structured self distillation framework that introduces diagnostically meaningful privileged information for video reasoning. VISD employs a video aware judge model to decompose reasoning quality into multiple dimensions, including answer correctness, logical consistency, and spatio-temporal grounding, and uses this structured feedback to guide a teacher policy for token level supervision. To stably integrate dense supervision with RL, we introduce a direction magnitude decoupling mechanism, where rollout level advantages computed from rewards determine update direction, while structured privileged signals modulate token level update magnitudes. This design enables semantically aligned and fine grained credit assignment, improving both reasoning faithfulness and training efficiency. Additionally, VISD incorporates curriculum scheduling and EMA based teacher stabilization to support robust optimization over long video sequences. Experiments on diverse benchmarks show that VISD consistently outperforms strong baselines, improving answer accuracy and spatio temporal grounding quality. Notably, VISD reaches these gains with nearly 2x faster convergence in optimization steps, highlighting the effectiveness of structured self supervision in improving both performance and sample efficiency for VideoLLMs.