Taming Hallucinations: Boosting MLLMs' Video Understanding via Counterfactual Video Generation

📄 arXiv: 2512.24271v1 📥 PDF

作者: Zhe Huang, Hao Wen, Aiming Hao, Bingze Song, Meiqi Wu, Jiahong Wu, Xiangxiang Chu, Sheng Lu, Haoqian Wang

分类: cs.CV, cs.AI

发布日期: 2025-12-30

备注: 18 pages


💡 一句话要点

提出DualityForge框架以解决多模态大语言模型视频理解中的幻觉问题

🎯 匹配领域: 支柱九:具身大模型 (Embodied Foundation Models)

关键词: 多模态大语言模型 视频理解 反事实数据合成 对比训练 强化学习 视频编辑 问答生成 幻觉问题

📋 核心要点

  1. 现有的多模态大语言模型在处理反事实视频时,容易产生视觉无根的幻觉,主要由于对语言先验的过度依赖。
  2. 本文提出DualityForge框架,通过扩散式视频编辑生成反事实视频,并结合问答生成过程,自动创建高质量的训练数据。
  3. 实验结果显示,DualityForge显著减少了模型在反事实视频上的幻觉,提升幅度达到24.0%,并在其他基准测试中也表现出色。

📝 摘要(中文)

多模态大语言模型(MLLMs)在视频理解方面取得了显著进展,但它们面临着一个关键问题:过度依赖语言先验,导致在处理反事实视频时出现视觉无根幻觉。为了解决这一问题,本文提出了DualityForge,一个新颖的反事实数据合成框架,通过可控的扩散式视频编辑将真实视频转化为反事实场景。该框架自动生成高质量的问答对及原始-编辑视频对,以进行对比训练。实验结果表明,该方法在减少模型幻觉方面取得了显著效果,尤其是在反事实视频上,相较于Qwen2.5-VL-7B基线提升了24.0%。

🔬 方法详解

问题定义:本文旨在解决多模态大语言模型在处理反事实视频时产生的幻觉问题。现有方法由于数据不平衡,难以有效应对这一挑战。

核心思路:提出DualityForge框架,通过可控的扩散式视频编辑技术,将真实视频转化为反事实场景,从而生成高质量的训练数据,减少模型的幻觉现象。

技术框架:DualityForge包含两个主要模块:视频编辑模块和问答生成模块。视频编辑模块负责生成反事实视频,而问答生成模块则基于编辑后的视频自动生成问答对。

关键创新:最重要的创新在于通过结构化上下文信息嵌入视频编辑和问答生成过程,自动生成高质量的问答对,形成对比训练数据集。

关键设计:在训练过程中,采用了Duality-Normalized Advantage Training (DNA-Train)方法,结合对比数据进行两阶段的强化学习训练,使用$ ext{pair-wise } ext{l}_1$优势归一化以实现更稳定的策略优化。

🖼️ 关键图片

fig_0
fig_1
fig_2

📊 实验亮点

实验结果表明,DualityForge在DualityVidQA-Test上显著减少了模型幻觉,相较于Qwen2.5-VL-7B基线提升了24.0%。此外,该方法在其他幻觉和通用基准测试中也表现出显著的性能提升,显示出强大的泛化能力。

🎯 应用场景

该研究的潜在应用领域包括视频内容生成、智能问答系统和多模态交互等。通过减少模型幻觉,提升视频理解的准确性和可靠性,未来可广泛应用于教育、娱乐和信息检索等多个行业,具有重要的实际价值和影响力。

📄 摘要(原文)

Multimodal Large Language Models (MLLMs) have made remarkable progress in video understanding. However, they suffer from a critical vulnerability: an over-reliance on language priors, which can lead to visual ungrounded hallucinations, especially when processing counterfactual videos that defy common sense. This limitation, stemming from the intrinsic data imbalance between text and video, is challenging to address due to the substantial cost of collecting and annotating counterfactual data. To address this, we introduce DualityForge, a novel counterfactual data synthesis framework that employs controllable, diffusion-based video editing to transform real-world videos into counterfactual scenarios. By embedding structured contextual information into the video editing and QA generation processes, the framework automatically produces high-quality QA pairs together with original-edited video pairs for contrastive training. Based on this, we build DualityVidQA, a large-scale video dataset designed to reduce MLLM hallucinations. In addition, to fully exploit the contrastive nature of our paired data, we propose Duality-Normalized Advantage Training (DNA-Train), a two-stage SFT-RL training regime where the RL phase applies pair-wise $\ell_1$ advantage normalization, thereby enabling a more stable and efficient policy optimization. Experiments on DualityVidQA-Test demonstrate that our method substantially reduces model hallucinations on counterfactual videos, yielding a relative improvement of 24.0% over the Qwen2.5-VL-7B baseline. Moreover, our approach achieves significant gains across both hallucination and general-purpose benchmarks, indicating strong generalization capability. We will open-source our dataset and code.