GUI-CEval: A Hierarchical and Comprehensive Chinese Benchmark for Mobile GUI Agents

作者: Yang Li, Yuchen Liu, Haoyu Lu, Zhiqiang Xia, Hongzhen Wang, Kaiyang Han, Changpeng Yang, Jinyang Wu, Jiaming Xu, Runyu Shi, Ying Huang

分类: cs.CV

发布日期: 2026-03-16

备注: accepted by CVPR 2026

💡 一句话要点

提出GUI-CEval以解决中文移动GUI代理评估不足问题

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 中文移动GUI 多模态大型语言模型 能力评估 用户交互 基准测试

📋 核心要点

现有的评估基准主要集中于英语，无法有效评估中文移动GUI代理的能力，且缺乏全面的能力评估框架。
本文提出GUI-CEval基准，涵盖多个设备类型和主流应用，采用两级结构评估代理的多维能力。
实验结果显示，尽管部分模型表现竞争力，但在反思决策和自我评估方面仍存在显著不足，影响实际应用效果。

📝 摘要（中文）

随着多模态大型语言模型（MLLMs）的进步，移动GUI代理具备了视觉感知、跨模态推理和交互控制的能力。然而，现有基准主要集中于英语，未能捕捉中文移动生态系统的语言和交互特征，也缺乏一个统一的框架来评估从感知到执行的完整能力链。为此，本文提出了GUI-CEval，这是第一个针对中文移动GUI代理的综合基准，涵盖201款主流应用，采用两级结构评估感知、规划、反思、执行和评估五个维度的能力。通过多阶段手动过程收集和验证数据，确保真实性和可重复性。实验表明，尽管部分模型表现良好，但大多数MLLMs在反思决策和后续自我评估方面仍存在明显不足，限制了其在现实交互中的可靠性。

🔬 方法详解

问题定义：现有的移动GUI代理评估方法多集中于英语，缺乏针对中文环境的全面评估框架，无法有效捕捉中文用户的交互特征和需求。

核心思路：本文通过构建GUI-CEval基准，提供一个涵盖多种设备和应用的综合评估体系，旨在全面评估中文移动GUI代理的能力链。

技术框架：GUI-CEval采用两级结构，首先评估代理的原子能力，然后在实际应用场景中进行综合性能评估，涵盖感知、规划、反思、执行和评估五个维度。

关键创新：GUI-CEval是第一个专为中文移动GUI代理设计的综合基准，填补了现有评估方法的空白，提供了更细致的能力评估。

关键设计：数据通过多阶段手动过程收集和验证，确保其真实性和可重复性，评估指标设计涵盖了多种实际应用场景。

🖼️ 关键图片

📊 实验亮点

实验结果表明，尽管如Qwen2.5-VL和UI-TARS等模型在某些任务上表现良好，但大多数MLLMs在反思决策和后续自我评估方面仍存在明显不足，影响其在实际交互中的可靠性。这一发现为未来的模型改进提供了重要方向。

🎯 应用场景

该研究的潜在应用领域包括移动应用开发、用户体验优化和人工智能助手的评估等。通过提供一个全面的评估基准，GUI-CEval能够帮助开发者更好地理解和提升中文移动GUI代理的能力，推动相关技术的进步与应用。未来，该基准有望成为行业标准，促进中文AI技术的发展。

📄 摘要（原文）

Recent progress in Multimodal Large Language Models (MLLMs) has enabled mobile GUI agents capable of visual perception, cross-modal reasoning, and interactive control. However, existing benchmarks are largely English-centric and fail to capture the linguistic and interaction characteristics of the Chinese mobile ecosystem. They also focus on isolated skills such as GUI grounding or offline agent, lacking a unified and fine-grained framework to assess the full capability chain from perception to execution. To address this gap, we introduce GUI-CEval, the first comprehensive benchmark for Chinese mobile GUI agents, built entirely on physical device environments. GUI-CEval spans 201 mainstream apps across four device types and adopts a two-level structure that evaluates both atomic abilities and realistic application-level performance along five dimensions: perception, planning, reflection, execution, and evaluation. All data are collected and verified through multi-stage manual processes to ensure authenticity and reproducibility. Extensive experiments on 20 representative MLLMs and multi-agent systems show that while models such as Qwen2.5-VL and UI-TARS perform competitively, most MLLMs still exhibit clear weaknesses in reflective decision-making and post-action self-evaluation, limiting their reliability in real-world interactions. We hope GUI-CEval provides a comprehensive and interpretable benchmark to guide capability diagnosis and advance the development of Chinese mobile GUI agents.

GUI-CEval: A Hierarchical and Comprehensive Chinese Benchmark for Mobile GUI Agents

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理