Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale

作者: Rogerio Bonatti, Dan Zhao, Francesco Bonacci, Dillon Dupont, Sara Abdali, Yinheng Li, Yadong Lu, Justin Wagle, Kazuhito Koishida, Arthur Bucker, Lawrence Jang, Zack Hui

分类: cs.AI

发布日期: 2024-09-12 (更新: 2024-09-13)

🔗 代码/项目: GITHUB | PROJECT_PAGE

💡 一句话要点

提出Windows Agent Arena，用于大规模评估多模态操作系统Agent

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 操作系统Agent 多模态学习 强化学习 Agent评估 Windows操作系统

📋 核心要点

现有Agent评估benchmark通常局限于特定模态或领域，且完整评估耗时过长，难以反映真实应用场景。
提出Windows Agent Arena，一个基于真实Windows操作系统的通用评估环境，支持Agent自由操作和使用各种工具。
构建了包含150+任务的benchmark，并设计了多模态Agent Navi，在Windows和Web任务上均表现出良好性能。

📝 摘要（中文）

大型语言模型(LLMs)在作为计算机Agent方面展现出巨大潜力，能够增强人类在需要规划和推理的多模态任务中的生产力和软件可访问性。然而，在真实环境中衡量Agent的性能仍然是一个挑战，因为：(i)大多数基准测试仅限于特定的模态或领域(例如，纯文本、网页导航、问答、编码)；(ii)考虑到任务的多步骤顺序性质，完整的基准评估非常缓慢(数量级为天)。为了应对这些挑战，我们引入了Windows Agent Arena：一个可复现的通用环境，专门关注Windows操作系统(OS)，Agent可以在真实的Windows OS中自由操作，并使用与人类用户相同的各种应用程序、工具和Web浏览器来解决任务。我们改编了OSWorld框架(Xie et al., 2024)，创建了150多个不同的Windows任务，涵盖了需要Agent具备规划、屏幕理解和工具使用能力的代表性领域。我们的基准测试是可扩展的，并且可以在Azure中无缝并行化，从而在短短20分钟内完成完整的基准评估。为了展示Windows Agent Arena的功能，我们还引入了一种新的多模态Agent，Navi。我们的Agent在Windows域中实现了19.5%的成功率，而未辅助的人类性能为74.5%。Navi还在另一个流行的基于Web的基准测试Mind2Web上表现出强大的性能。我们对Navi的性能进行了广泛的定量和定性分析，并深入了解了使用Windows Agent Arena在Agent开发和数据生成方面进行未来研究的机会。

🔬 方法详解

问题定义：现有Agent评估方法存在局限性，主要体现在两个方面：一是评估环境单一，通常只关注文本、网页或特定应用，缺乏对通用操作系统环境的覆盖；二是评估效率低下，由于任务的复杂性和Agent的推理过程，完整评估需要耗费大量时间。这使得Agent的开发和迭代速度受到限制。

核心思路：论文的核心思路是构建一个可扩展、可复现的通用操作系统环境，用于评估Agent在真实场景下的性能。通过模拟真实的Windows操作系统，Agent可以使用各种应用程序、工具和Web浏览器来完成任务。这种方法能够更全面地评估Agent的规划、屏幕理解和工具使用能力。

技术框架：Windows Agent Arena基于OSWorld框架进行改编，主要包含以下几个模块：1）Windows操作系统环境：提供Agent操作的平台，包含各种应用程序和工具；2）任务定义模块：定义了150多个不同的Windows任务，涵盖了代表性领域；3）评估模块：用于评估Agent在完成任务时的性能，包括成功率、时间等指标；4）多模态Agent Navi：作为示例Agent，展示了在Windows Agent Arena中的应用。整个框架可以在Azure中并行化运行，从而提高评估效率。

关键创新：论文的关键创新点在于构建了一个可扩展、可复现的通用操作系统评估环境。与现有的评估方法相比，Windows Agent Arena更加贴近真实应用场景，能够更全面地评估Agent的性能。此外，该框架还支持并行化运行，大大提高了评估效率。

关键设计：任务设计方面，论文改编了OSWorld框架，创建了150+个Windows任务，覆盖了规划、屏幕理解和工具使用等Agent能力。Agent设计方面，提出了多模态Agent Navi，具体架构和参数设置未知，但实验结果表明其在Windows和Web任务上均表现出良好性能。

🖼️ 关键图片

📊 实验亮点

实验结果表明，提出的多模态Agent Navi在Windows Agent Arena中实现了19.5%的成功率，虽然低于人类的74.5%，但证明了Agent在复杂操作系统环境中执行任务的潜力。此外，Navi在Mind2Web基准测试上也表现出良好的性能，表明其具有一定的通用性。Windows Agent Arena可以在20分钟内完成完整的基准评估，大大提高了评估效率。

🎯 应用场景

该研究成果可应用于开发更智能、更通用的计算机Agent，从而提高人类在各种任务中的生产力。例如，可以利用Agent自动完成重复性工作、辅助用户进行复杂操作、甚至在用户不在时执行任务。此外，该研究还有助于推动Agent技术在软件可访问性方面的应用，使更多人能够更方便地使用计算机。

📄 摘要（原文）

Large language models (LLMs) show remarkable potential to act as computer agents, enhancing human productivity and software accessibility in multi-modal tasks that require planning and reasoning. However, measuring agent performance in realistic environments remains a challenge since: (i) most benchmarks are limited to specific modalities or domains (e.g. text-only, web navigation, Q&A, coding) and (ii) full benchmark evaluations are slow (on order of magnitude of days) given the multi-step sequential nature of tasks. To address these challenges, we introduce the Windows Agent Arena: a reproducible, general environment focusing exclusively on the Windows operating system (OS) where agents can operate freely within a real Windows OS and use the same wide range of applications, tools, and web browsers available to human users when solving tasks. We adapt the OSWorld framework (Xie et al., 2024) to create 150+ diverse Windows tasks across representative domains that require agent abilities in planning, screen understanding, and tool usage. Our benchmark is scalable and can be seamlessly parallelized in Azure for a full benchmark evaluation in as little as 20 minutes. To demonstrate Windows Agent Arena's capabilities, we also introduce a new multi-modal agent, Navi. Our agent achieves a success rate of 19.5% in the Windows domain, compared to 74.5% performance of an unassisted human. Navi also demonstrates strong performance on another popular web-based benchmark, Mind2Web. We offer extensive quantitative and qualitative analysis of Navi's performance, and provide insights into the opportunities for future research in agent development and data generation using Windows Agent Arena. Webpage: https://microsoft.github.io/WindowsAgentArena Code: https://github.com/microsoft/WindowsAgentArena

Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理