Almieyar-Oryx-BloomBench: A Bilingual Multimodal Benchmark for Cognitively Informed Evaluation of Vision-Language Models

📄 arXiv: 2606.05531v1 📥 PDF

作者: Mohammad Mahdi Abootorabi, Omid Ghahroodi, Anas Madkoor, Marzia Nouri, Doratossadat Dastgheib, Mohamed Hefeeda, Ehsaneddin Asgari

分类: cs.CV, cs.AI, cs.CL, cs.LG

发布日期: 2026-06-04

备注: Accepted to ACL 2026 Findings

🔗 代码/项目: GITHUB


💡 一句话要点

提出BloomBench以解决多模态模型评估的认知能力不足问题

🎯 匹配领域: 支柱九:具身大模型 (Embodied Foundation Models)

关键词: 视觉-语言模型 多模态评估 认知科学 双语基准 Bloom分类法 阿拉伯语 英语 人工智能

📋 核心要点

  1. 现有的多模态模型评估缺乏系统性,无法全面揭示模型的认知能力和潜在弱点。
  2. 本文提出BloomBench基准,基于Bloom的认知分类法,系统评估视觉-语言模型的认知能力。
  3. 研究发现当前模型在语义理解上表现优异,但在事实回忆和创造性综合方面存在明显不足,且阿拉伯语与英语的性能差距显著。

📝 摘要(中文)

尽管视觉-语言模型(VLMs)快速发展,但缺乏能够严格诊断其推理能力的基准,现有评估往往关注零散任务,无法揭示认知弱点。为此,本文提出BloomBench,这是第一个基于人类认知的双语(英语-阿拉伯语)多模态基准,系统评估六个认知层次。通过半自动化流程和分层混合质量保证协议,确保了可扩展性和文化包容性。研究表明,尽管当前模型在语义理解上表现良好,但在事实回忆和创造性综合方面存在显著不足,同时揭示了阿拉伯语与英语之间的性能差距,为未来开发更具认知对齐和包容性的VLMs奠定了基础。

🔬 方法详解

问题定义:本文旨在解决现有多模态模型评估缺乏系统性和深度的问题,现有方法往往无法揭示模型在认知层面的真实能力和不足之处。

核心思路:论文提出BloomBench基准,基于Bloom的认知分类法,设计了六个认知层次的任务,以全面评估视觉-语言模型的认知能力。

技术框架:BloomBench的整体架构包括任务设计、数据收集和质量保证三个主要模块。任务设计涵盖了记忆、理解、应用、分析、评估和创造六个层次,确保评估的全面性和系统性。

关键创新:最重要的创新在于将认知科学与多模态评估相结合,首次提出双语(英语-阿拉伯语)基准,填补了现有评估的空白。

关键设计:在任务设计中,采用了半自动化流程,并通过分层混合质量保证协议确保数据的质量和多样性,确保了评估的可扩展性和文化包容性。

🖼️ 关键图片

fig_0
fig_1
fig_2

📊 实验亮点

实验结果显示,当前最先进的视觉-语言模型在语义理解上达到较高的性能,但在事实回忆和创造性综合方面的表现显著不足,尤其是在阿拉伯语任务中,性能差距明显。这些发现为未来模型的改进提供了重要依据。

🎯 应用场景

该研究的潜在应用领域包括教育、人工智能助手和跨文化交流等。通过提供更具认知对齐的评估基准,研究可以帮助开发更智能的视觉-语言模型,提升其在实际应用中的表现和用户体验。

📄 摘要(原文)

Despite the rapid progress of Vision-Language Models (VLMs), the field lacks benchmarks that rigorously diagnose their true reasoning abilities and chart meaningful progress toward human-like multimodal intelligence. Most existing evaluations focus on piecemeal or disconnected tasks, obscuring critical cognitive weaknesses and providing little insight for targeted improvement. To address this gap, we introduce BloomBench, part of the Almieyar benchmarking series, the first cognitively human-grounded, bilingual (English-Arabic) multimodal benchmark for VLMs. Grounded in Bloom's Taxonomy, BloomBench systematically evaluates six levels of cognition (Remember, Understand, Apply, Analyze, Evaluate, Create) through carefully designed image-question-answer tasks. Built with a semi-automated pipeline and validated through a stratified hybrid quality assurance protocol, it ensures scalability, cultural inclusivity, and linguistic fidelity. Leveraging this framework, we conduct a comprehensive study of state-of-the-art VLMs to diagnose their cognitive profiles. Our analysis reveals a sharp cognitive asymmetry: while state-of-the-art models achieve strong performance ceilings in semantic understanding, they struggle substantially with factual recall and creative synthesis. This demonstrates that current general multimodal proficiency masks deeper limitations in specific cognitive layers. Furthermore, our study highlights a critical performance gap between Arabic and English, exposing limitations in current cross-lingual multimodal reasoning. These findings establish a foundation for developing more cognitively aligned and inclusive VLMs. The benchmark framework and dataset is available at: https://github.com/qcri/Almieyar-Oryx-BloomBench.