Beyond Raw Videos: Understanding Edited Videos with Large Multimodal Model

作者: Lu Xu, Sijie Zhu, Chunyuan Li, Chia-Wen Kuo, Fan Chen, Xinyao Wang, Guang Chen, Dawei Du, Ye Yuan, Longyin Wen

分类: cs.CV

发布日期: 2024-06-15 (更新: 2024-09-27)

🔗 代码/项目: GITHUB

💡 一句话要点

提出EditVid-QA基准，用于评估大型多模态模型在理解社交媒体编辑视频方面的能力。

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 视频理解 多模态模型 编辑视频 VQA基准 社交媒体 领域泛化 GPT-4评估

📋 核心要点

现有视频LMMs基准主要关注原始视频，忽略了社交媒体上大量存在的编辑视频，导致模型在理解这些视频时表现不佳。
论文构建了EditVid-QA基准，包含特效、搞笑、表情包和游戏四种编辑类别，用于评估模型对编辑视频的理解能力。
通过在包含原始视频和编辑视频的混合数据集上进行训练，显著提升了LMMs在EditVid-QA基准上的性能，验证了高质量训练数据的有效性。

📝 摘要（中文）

现有的视频大型多模态模型(LMMs)在通用视频理解方面取得了显著进展，主要针对相机拍摄的原始视频。然而，现实应用中大量视频是经过编辑的，例如用户在社交媒体平台上发布前会对原始视频进行剪辑和添加特效。这些编辑过的视频通常具有很高的浏览量，但现有的视频LMMs基准测试，如ActivityNet-QA或VideoChatGPT基准，并未覆盖它们。本文利用TikTok上的编辑视频，构建了一个视频VQA基准(名为EditVid-QA)，涵盖四种典型的编辑类别：特效、搞笑、表情包和游戏。搞笑和表情包视频需要细致的理解和高层次的推理，而特效和游戏则评估对人工设计的理解能力。大多数开源视频LMMs在EditVid-QA基准测试中表现不佳，表明社交媒体上的编辑短视频与常规原始视频之间存在巨大的领域差距。为了提高LMMs的泛化能力，我们基于Panda-70M/WebVid原始视频和少量TikTok/CapCut编辑视频，为提出的基准收集了一个训练集，从而提高了在EditVid-QA基准测试上的性能，表明了高质量训练数据的有效性。我们还发现现有使用GPT-3.5进行评估的协议存在严重问题，即“sorry”攻击，其中一种sorry风格的简单回答可以从GPT法官那里获得极高的评分，例如在VideoChatGPT评估协议中，正确性得分超过4.3。为了避免“sorry”攻击，我们使用GPT-4法官和关键词过滤来评估结果。数据集已在https://github.com/XenonLamb/EditVid-QA上发布。

🔬 方法详解

问题定义：现有视频大型多模态模型(LMMs)在处理社交媒体上流行的编辑视频时表现不佳。这些编辑视频包含各种特效、剪辑和模因等元素，对模型的理解能力提出了更高的要求。现有的视频理解基准主要关注原始视频，缺乏对编辑视频的针对性评估，导致模型在实际应用中泛化能力不足。

核心思路：论文的核心思路是构建一个专门针对编辑视频的VQA基准测试集EditVid-QA，并利用包含原始视频和编辑视频的混合数据集来训练LMMs，从而提高模型对编辑视频的理解能力。通过这种方式，可以弥补现有基准测试的不足，并提升模型在实际应用中的性能。

技术框架：该研究主要包含以下几个阶段：1) 数据收集：从TikTok等社交媒体平台收集编辑视频，并进行分类标注。2) 基准构建：构建EditVid-QA基准，包含四种编辑类别：特效、搞笑、表情包和游戏。3) 模型训练：使用包含Panda-70M/WebVid原始视频和少量TikTok/CapCut编辑视频的混合数据集训练LMMs。4) 模型评估：在EditVid-QA基准上评估模型的性能，并使用GPT-4法官和关键词过滤来避免“sorry”攻击。

关键创新：该论文的关键创新在于：1) 提出了EditVid-QA基准，填补了现有视频理解基准在编辑视频方面的空白。2) 发现并解决了使用GPT-3.5进行评估时存在的“sorry”攻击问题，提出了使用GPT-4法官和关键词过滤的解决方案。3) 通过混合训练数据，有效提升了LMMs在编辑视频理解方面的性能。

关键设计：EditVid-QA基准包含四种编辑类别，每种类别都设计了相应的VQA问题，以评估模型在不同方面的理解能力。在模型训练方面，使用了包含原始视频和编辑视频的混合数据集，并调整了训练策略，以平衡不同类型视频的影响。在模型评估方面，使用了GPT-4法官和关键词过滤，以确保评估结果的准确性和可靠性。

🖼️ 关键图片

📊 实验亮点

实验结果表明，现有的开源视频LMMs在EditVid-QA基准测试中表现不佳，表明编辑视频与原始视频之间存在显著的领域差距。通过使用包含原始视频和编辑视频的混合数据集进行训练，LMMs在EditVid-QA基准上的性能得到了显著提升，验证了该方法的有效性。

🎯 应用场景

该研究成果可应用于社交媒体内容理解、智能视频推荐、视频内容审核等领域。通过提高模型对编辑视频的理解能力，可以更准确地分析用户兴趣，提升推荐系统的效果，并有效过滤不良内容，维护网络环境的健康。

📄 摘要（原文）

The emerging video LMMs (Large Multimodal Models) have achieved significant improvements on generic video understanding in the form of VQA (Visual Question Answering), where the raw videos are captured by cameras. However, a large portion of videos in real-world applications are edited videos, \textit{e.g.}, users usually cut and add effects/modifications to the raw video before publishing it on social media platforms. The edited videos usually have high view counts but they are not covered in existing benchmarks of video LMMs, \textit{i.e.}, ActivityNet-QA, or VideoChatGPT benchmark. In this paper, we leverage the edited videos on a popular short video platform, \textit{i.e.}, TikTok, and build a video VQA benchmark (named EditVid-QA) covering four typical editing categories, i.e., effect, funny, meme, and game. Funny and meme videos benchmark nuanced understanding and high-level reasoning, while effect and game evaluate the understanding capability of artificial design. Most of the open-source video LMMs perform poorly on the EditVid-QA benchmark, indicating a huge domain gap between edited short videos on social media and regular raw videos. To improve the generalization ability of LMMs, we collect a training set for the proposed benchmark based on both Panda-70M/WebVid raw videos and small-scale TikTok/CapCut edited videos, which boosts the performance on the proposed EditVid-QA benchmark, indicating the effectiveness of high-quality training data. We also identified a serious issue in the existing evaluation protocol using the GPT-3.5 judge, namely a "sorry" attack, where a sorry-style naive answer can achieve an extremely high rating from the GPT judge, e.g., over 4.3 for correctness score on VideoChatGPT evaluation protocol. To avoid the "sorry" attacks, we evaluate results with GPT-4 judge and keyword filtering. The dataset is released at https://github.com/XenonLamb/EditVid-QA.

Beyond Raw Videos: Understanding Edited Videos with Large Multimodal Model

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理