Judge Anything: MLLM as a Judge Across Any Modality

作者: Shu Pu, Yaochen Wang, Dongping Chen, Yuhang Chen, Guohao Wang, Qi Qin, Zhongyi Zhang, Zhiyuan Zhang, Zetong Zhou, Shuang Gong, Yi Gui, Yao Wan, Philip S. Yu

分类: cs.CL, cs.CV

发布日期: 2025-03-21

🔗 代码/项目: PROJECT_PAGE

💡 一句话要点

提出TaskAnything和JudgeAnything基准，评估MLLM在跨模态理解和生成任务中的表现

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 多模态学习 大语言模型 自动评估 跨模态理解 跨模态生成 基准测试 模型评估

📋 核心要点

现有的多模态评估方法难以应对复杂跨模态交互，缺乏统一的评估标准和工具。
提出TaskAnything和JudgeAnything基准，利用MLLM作为自动评估器，统一评估跨模态理解和生成能力。
实验表明MLLM在MMU任务评估中表现出潜力，但在MMG任务评估中存在偏差和幻觉问题。

📝 摘要（中文）

由于跨模态交互的复杂性，评估生成式基础模型在开放式多模态理解（MMU）和生成（MMG）任务中的性能面临巨大挑战。本文提出利用多模态LLM（MLLM）作为自动评估器的思想，并在评估视觉-语言理解任务中取得了令人鼓舞的结果。进一步地，本文通过引入TaskAnything和JudgeAnything两个基准，将MLLM作为评估器的思想扩展到跨模态领域，以统一的方式评估MLLM在任意模态间的任务中的整体性能和评估能力。TaskAnything评估了15个任意模态间类别的MMU和MMG能力，采用了来自成熟基准的1500个查询。JudgeAnything从配对比较和评分评估的角度评估了5个先进的MLLM（如GPT-4o和Gemini-2.0-Flash）的评估能力，提供了一个包含人工评估和详细评分标准的标准化测试平台。实验表明，MLLM在评估MMU任务时表现出潜力（配对比较平均66.55%，评分评估平均42.79%），但在MMG任务中面临显著挑战（配对比较平均53.37%，评分评估平均30.05%），暴露了跨模态偏差和幻觉问题。为此，我们提出了OmniArena，一个用于评估全模态模型和多模态奖励模型的自动化平台。这项工作强调了对更公平的评估协议和与人类偏好更强对齐的需求。源代码和数据集已公开。

🔬 方法详解

问题定义：论文旨在解决多模态大语言模型（MLLM）在跨模态理解（MMU）和生成（MMG）任务中的评估问题。现有的评估方法通常针对特定模态或任务，缺乏通用性和可比性，难以全面评估MLLM的跨模态能力。此外，人工评估成本高昂，难以规模化应用。

核心思路：论文的核心思路是利用MLLM自身作为自动评估器，通过设计合适的基准和评估协议，让MLLM对其他MLLM或模型的输出进行评估。这种方法可以降低评估成本，提高评估效率，并提供更全面的跨模态能力评估。

技术框架：论文提出了两个基准：TaskAnything和JudgeAnything。TaskAnything是一个多模态任务基准，包含15个任意模态间的MMU和MMG任务，用于评估MLLM的整体性能。JudgeAnything是一个评估基准，用于评估MLLM作为评估器的能力，包括配对比较和评分评估两种模式。此外，论文还提出了OmniArena，一个用于评估全模态模型和多模态奖励模型的自动化平台。

关键创新：论文的关键创新在于将MLLM作为自动评估器，并设计了TaskAnything和JudgeAnything两个基准，实现了对MLLM跨模态能力的统一评估。这种方法不仅降低了评估成本，还提高了评估效率，并为多模态模型的发展提供了新的评估视角。

关键设计：JudgeAnything基准的关键设计包括：1) 采用配对比较和评分评估两种模式，全面评估MLLM的评估能力；2) 引入人工评估作为ground truth，用于评估MLLM评估结果的准确性；3) 设计详细的评分标准，指导MLLM进行评估，并提高评估结果的可解释性。

🖼️ 关键图片

📊 实验亮点

实验结果表明，MLLM在评估MMU任务时表现出潜力（配对比较平均66.55%，评分评估平均42.79%），但在MMG任务中面临显著挑战（配对比较平均53.37%，评分评估平均30.05%），暴露了跨模态偏差和幻觉问题。这表明现有的MLLM在跨模态生成方面仍有很大的提升空间。

🎯 应用场景

该研究成果可应用于多模态大模型的开发和评估，例如，可以利用该方法自动评估不同模型的性能，指导模型训练和优化。此外，该方法还可以用于构建多模态智能系统，例如智能助手、智能客服等，提高系统的智能化水平。

📄 摘要（原文）

Evaluating generative foundation models on open-ended multimodal understanding (MMU) and generation (MMG) tasks across diverse modalities (e.g., images, audio, video) poses significant challenges due to the complexity of cross-modal interactions. To this end, the idea of utilizing Multimodal LLMs (MLLMs) as automated judges has emerged, with encouraging results in assessing vision-language understanding tasks. Moving further, this paper extends MLLM-as-a-Judge across modalities to a unified manner by introducing two benchmarks, TaskAnything and JudgeAnything, to respectively evaluate the overall performance and judging capabilities of MLLMs across any-to-any modality tasks. Specifically, TaskAnything evaluates the MMU and MMG capabilities across 15 any-to-any modality categories, employing 1,500 queries curated from well-established benchmarks. Furthermore, JudgeAnything evaluates the judging capabilities of 5 advanced (e.g., GPT-4o and Gemini-2.0-Flash) from the perspectives of Pair Comparison and Score Evaluation, providing a standardized testbed that incorporates human judgments and detailed rubrics. Our extensive experiments reveal that while these MLLMs show promise in assessing MMU (i.e., achieving an average of 66.55% in Pair Comparison setting and 42.79% in Score Evaluation setting), they encounter significant challenges with MMG tasks (i.e., averaging only 53.37% in Pair Comparison setting and 30.05% in Score Evaluation setting), exposing cross-modality biases and hallucination issues. To address this, we present OmniArena, an automated platform for evaluating omni-models and multimodal reward models. Our work highlights the need for fairer evaluation protocols and stronger alignment with human preferences. The source code and dataset are publicly available at: https://urrealhero.github.io/judgeanythingweb/.

Judge Anything: MLLM as a Judge Across Any Modality

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理