VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation

作者: Ziyang Luo, Haoning Wu, Dongxu Li, Jing Ma, Mohan Kankanhalli, Junnan Li

分类: cs.CV, cs.AI, cs.CL, cs.MM

发布日期: 2024-11-20 (更新: 2025-03-23)

备注: CVPR 2025, Project Page: https://videoautoarena.github.io/

💡 一句话要点

VideoAutoArena：通过用户模拟自动评估视频分析大模型的竞技场基准

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 视频分析 大模型评估 用户模拟 自动化评估 ELO评分系统

📋 核心要点

现有视频分析大模型评估方法依赖多项选择题，无法充分捕捉真实用户的复杂需求，且人工标注成本高昂。
VideoAutoArena通过用户模拟生成开放式问题，并采用改进的ELO评分系统，实现自动化、可扩展的评估。
实验表明，VideoAutoArena能有效区分不同LMM的性能，并提供模型优缺点分析，同时引入VideoAutoBench辅助评估。

📝 摘要（中文）

近年来，具有先进视频分析能力的大型多模态模型（LMMs）备受关注。然而，大多数评估依赖于传统方法，如VideoMME和LongVideoBench等基准测试中的多项选择题，这些方法难以捕捉真实用户的复杂需求。为了解决这一局限性，并考虑到视频任务的人工标注成本高昂且速度慢，我们推出了VideoAutoArena，这是一个受LMSYS Chatbot Arena框架启发的竞技场式基准，旨在自动评估LMMs的视频分析能力。VideoAutoArena利用用户模拟生成开放式、自适应的问题，严格评估模型在视频理解方面的性能。该基准采用自动化、可扩展的评估框架，并结合改进的ELO评分系统，对多个LMM进行公平和持续的比较。为了验证我们的自动评判系统，我们使用精心策划的人工标注子集构建了一个“黄金标准”，证明我们的竞技场与人类判断高度一致，同时保持了可扩展性。此外，我们引入了一种故障驱动的演化策略，逐步增加问题的复杂性，以推动模型处理更具挑战性的视频分析场景。实验结果表明，VideoAutoArena有效地区分了最先进的LMM，并提供了关于模型优势和改进领域的见解。为了进一步简化我们的评估，我们引入了VideoAutoBench作为辅助基准，人工标注员在VideoAutoArena战斗的子集中标记获胜者。我们使用GPT-4o作为评判员，将响应与这些人工验证的答案进行比较。总之，VideoAutoArena和VideoAutoBench为评估以用户为中心的视频分析中的LMM提供了一个经济高效且可扩展的框架。

🔬 方法详解

问题定义：现有的大型多模态模型（LMMs）在视频分析方面的评估主要依赖于多项选择题等传统方法，这些方法无法充分模拟真实用户的复杂需求，难以全面评估模型的视频理解能力。此外，人工标注视频数据成本高昂且效率低下，限制了评估的规模和速度。

核心思路：VideoAutoArena的核心思路是构建一个自动化的竞技场式评估环境，通过用户模拟生成开放式、自适应的问题，并利用改进的ELO评分系统对LMMs进行公平、持续的比较。这种方法旨在更真实地反映用户与视频分析模型的交互方式，并降低评估成本。

技术框架：VideoAutoArena的整体框架包含以下几个主要模块：1) 用户模拟器：生成开放式、自适应的视频分析问题。2) 模型响应生成器：LMMs根据问题生成相应的答案。3) 自动评判系统：使用改进的ELO评分系统对模型响应进行评估。4) 故障驱动的演化策略：逐步增加问题的复杂性，以推动模型处理更具挑战性的场景。5) VideoAutoBench：作为辅助基准，使用人工标注结果验证自动评判系统的准确性。

关键创新：VideoAutoArena的关键创新在于其自动化、可扩展的评估框架，以及用户模拟和故障驱动演化策略的应用。与传统的基于选择题的评估方法相比，VideoAutoArena能够更真实地模拟用户与视频分析模型的交互，并更全面地评估模型的视频理解能力。此外，故障驱动的演化策略能够逐步增加问题的难度，从而更好地发现模型的弱点。

关键设计：VideoAutoArena的关键设计包括：1) 用户模拟器的设计，需要生成多样化、有挑战性的问题。2) 改进的ELO评分系统的设计，需要保证评估的公平性和准确性。3) 故障驱动的演化策略的设计，需要合理地增加问题的复杂性。4) VideoAutoBench的设计，需要选择具有代表性的视频和问题，并进行高质量的人工标注。

🖼️ 关键图片

📊 实验亮点

VideoAutoArena通过实验验证了其自动评判系统与人工判断高度一致，同时保持了可扩展性。实验结果表明，VideoAutoArena能够有效区分不同LMM的性能，并提供了关于模型优势和改进领域的见解。此外，VideoAutoBench作为辅助基准，进一步验证了评估结果的可靠性。

🎯 应用场景

VideoAutoArena可应用于视频监控、自动驾驶、智能家居、内容审核等领域，帮助开发者快速评估和优化视频分析大模型的性能，提升用户体验。该研究为构建更智能、更可靠的视频分析系统奠定了基础，具有广阔的应用前景。

📄 摘要（原文）

Large multimodal models (LMMs) with advanced video analysis capabilities have recently garnered significant attention. However, most evaluations rely on traditional methods like multiple-choice questions in benchmarks such as VideoMME and LongVideoBench, which are prone to lack the depth needed to capture the complex demands of real-world users. To address this limitation-and due to the prohibitive cost and slow pace of human annotation for video tasks-we introduce VideoAutoArena, an arena-style benchmark inspired by LMSYS Chatbot Arena's framework, designed to automatically assess LMMs' video analysis abilities. VideoAutoArena utilizes user simulation to generate open-ended, adaptive questions that rigorously assess model performance in video understanding. The benchmark features an automated, scalable evaluation framework, incorporating a modified ELO Rating System for fair and continuous comparisons across multiple LMMs. To validate our automated judging system, we construct a 'gold standard' using a carefully curated subset of human annotations, demonstrating that our arena strongly aligns with human judgment while maintaining scalability. Additionally, we introduce a fault-driven evolution strategy, progressively increasing question complexity to push models toward handling more challenging video analysis scenarios. Experimental results demonstrate that VideoAutoArena effectively differentiates among state-of-the-art LMMs, providing insights into model strengths and areas for improvement. To further streamline our evaluation, we introduce VideoAutoBench as an auxiliary benchmark, where human annotators label winners in a subset of VideoAutoArena battles. We use GPT-4o as a judge to compare responses against these human-validated answers. Together, VideoAutoArena and VideoAutoBench offer a cost-effective, and scalable framework for evaluating LMMs in user-centric video analysis.

VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理