IMUG-Bench: Benchmarking Unified Multimodal Models on Interleaved Understanding and Generation
作者: Lingyi Meng, Zecong Tang, Haoran Li, Tengju Ru, Zhejun Cui, Weitong Lian, Qi Kang, Hangshuo Cao, Yichen Zhu, Yechi Liu, Kaixuan Wang, Yu-Jie Yuan, Chunwei Wang, Yu Zhang, Bo Dai
分类: cs.AI, cs.CV, cs.MM
发布日期: 2026-06-08
💡 一句话要点
提出IMUG-Bench以解决多轮图文对话评估问题
🎯 匹配领域: 支柱九:具身大模型 (Embodied Foundation Models)
关键词: 统一多模态模型 多轮对话 图文理解 生成任务 曝光偏差 评估基准 动态理解 智能助手
📋 核心要点
- 现有基准测试未能有效评估多轮图文对话中的理解与生成能力,存在曝光偏差问题。
- 提出IMUG-Bench基准,涵盖多轮交互场景,系统评估UMMs的理解与生成能力。
- 通过大规模实验,揭示UMMs的能力边界,探索多种策略以提高生成准确性,减轻偏差。
📝 摘要(中文)
近年来,统一多模态模型(UMMs)在单一框架内支持理解与生成的能力逐渐受到关注。然而,现有基准测试往往局限于单轮或静态设置,未能有效评估多轮交互中的理解与生成能力。为此,本文提出IMUG-Bench,一个全面的多轮图文对话基准,涵盖3,113个样本和12,034个交互轮次,支持动态理解问题的评估。通过对主流UMMs的系统性实验,揭示了其能力边界和失败模式,并探讨了多种测试时扩展策略,以提高生成准确性并减轻生成任务中的曝光偏差。这些发现为未来UMMs的鲁棒性和多轮交互能力的提升提供了重要见解。
🔬 方法详解
问题定义:本文旨在解决现有基准测试无法有效评估多轮图文对话的理解与生成能力的问题,尤其是未考虑多轮交互中的曝光偏差。
核心思路:提出IMUG-Bench基准,通过动态理解问题和多轮交互样本,全面评估UMMs的性能,反映真实场景中的交互需求。
技术框架:IMUG-Bench包括三类任务:静态空间、时间因果和混合,涵盖3,113个样本和12,034个交互轮次,支持多种理解与生成评估。
关键创新:IMUG-Bench的创新在于其全面性和动态性,能够有效评估UMMs在多轮交互中的表现,特别是针对生成侧的曝光偏差进行深入分析。
关键设计:在实验中,采用了多种测试时扩展策略,如思维链、自我验证和最佳采样,显著提高了生成任务的准确性,同时减轻了曝光偏差。
📊 实验亮点
实验结果表明,主流UMMs在IMUG-Bench上的表现存在明显的能力边界和失败模式,尤其在生成任务中暴露出显著的偏差。通过引入思维链等策略,生成准确性提升了XX%,有效缓解了多轮交互中的曝光偏差。
🎯 应用场景
IMUG-Bench的提出为多模态对话系统的研究提供了新的评估标准,具有广泛的应用潜力。它可以用于智能客服、虚拟助手等领域,提升系统在复杂交互场景中的表现,推动多模态技术的实际应用与发展。
📄 摘要(原文)
In recent years, unified multimodal models (UMMs) have emerged to support both understanding and generation within a single framework. Mastering dynamic, multi-turn interleaved image-text dialogues is a crucial task for UMMs in real-world applications. However, existing benchmarks fail to evaluate this important task, as they are often limited to single-turn or static settings, and typically overlook exposure bias in multi-turn interactions. To bridge this gap, we propose IMUG-Bench, a comprehensive benchmark for multi-turn interleaved image-text dialogue of UMMs that jointly evaluates their understanding and generation capabilities. Our IMUG-Bench comprises three classes: Static Spatial, Temporal Causal, and Hybrid, covering 3,113 samples and 12,034 interaction turns. It also includes dynamic understanding questions, thereby supporting evaluation that better reflects real-world multi-turn interaction scenarios. Large-scale experiments on IMUG-Bench systematically evaluate mainstream open-source and closed-source UMMs, revealing their capability boundaries and failure modes, and uncovering pronounced exposure bias on the generation side in multi-turn interactions. We further explore several test-time scaling strategies, including Chain-of-Thought, Self-Verification, and Best-of-N Sampling, which effectively improve generation accuracy and mitigate exposure bias in generation tasks. These findings provide insights into enhancing the robustness and multi-turn interaction capability of future UMMs.