SWIRL: A Staged Workflow for Interleaved Reinforcement Learning in Mobile GUI Control

作者: Quanfeng Lu, Zhantao Ma, Shuai Zhong, Jin Wang, Dahai Yu, Michael K. Ng, Ping Luo

分类: cs.AI, cs.CL, cs.CV, cs.MA

发布日期: 2025-08-27

备注: 28 pages, 12 figures

💡 一句话要点

提出SWIRL以解决移动GUI控制中的多智能体强化学习问题

🎯 匹配领域: 支柱二：RL算法与架构 (RL & Architecture)

关键词: 移动GUI控制 多智能体系统 强化学习 视觉语言模型 智能体协调 数学推理 实验验证

📋 核心要点

现有的单智能体方法在移动GUI控制中受到结构限制，无法有效处理复杂的界面操作。
SWIRL通过将多智能体强化学习转化为单智能体任务，逐个更新智能体，从而实现稳定训练和高效协调。
实验结果表明，SWIRL在高层和低层GUI基准测试中表现优越，并在多智能体数学推理中展现出强大能力。

📝 摘要（中文）

随着大型视觉语言模型（LVLMs）和智能体系统的快速发展，移动GUI智能体的研究受到关注，然而现有的单智能体方法受到结构限制。尽管多智能体系统能够自然地解耦不同的能力，但多智能体强化学习（MARL）的效率问题和与现有LVLM架构的不兼容性仍然存在。为了解决这些挑战，本文提出了SWIRL，一个为多智能体系统设计的交错强化学习的分阶段工作流。SWIRL将MARL重新表述为一系列单智能体强化学习任务，逐个更新智能体，同时保持其他智能体固定。这种方法促进了稳定的训练和高效的协调。通过大量实验，SWIRL在高层和低层GUI基准测试中表现出优越的性能，并在多智能体数学推理中展现出强大的能力。

🔬 方法详解

问题定义：本文旨在解决移动GUI控制中多智能体强化学习的效率问题，现有方法在处理复杂任务时存在结构限制和协调困难。

核心思路：SWIRL的核心思路是将多智能体强化学习转化为一系列单智能体任务，逐个更新智能体，保持其他智能体固定，从而实现稳定的训练过程。

技术框架：SWIRL的整体架构包括两个主要模块：Navigator和Interactor。Navigator负责将自然语言和屏幕上下文转化为结构化计划，而Interactor则将这些计划转化为可执行的原子动作。

关键创新：SWIRL的主要创新在于其将MARL重构为单智能体任务的方式，这种方法提高了训练的稳定性和效率，与传统的多智能体方法相比，能够更好地协调智能体之间的交互。

关键设计：SWIRL设计了逐步安全界限、跨轮单调改进定理和收益收敛保证等理论框架，确保了优化过程的稳健性和原则性。

📊 实验亮点

SWIRL在高层和低层GUI基准测试中表现出色，实验结果显示其性能优于现有基线，具体提升幅度达到XX%（具体数据未提供）。此外，SWIRL在多智能体数学推理任务中也展现了强大的能力，进一步验证了其有效性。

🎯 应用场景

SWIRL在移动GUI控制中的应用潜力巨大，能够有效地将自然语言转化为界面操作，提升用户体验。此外，其在多智能体数学推理中的表现也表明了其作为通用框架的潜力，能够广泛应用于其他多智能体系统的开发。

📄 摘要（原文）

The rapid advancement of large vision language models (LVLMs) and agent systems has heightened interest in mobile GUI agents that can reliably translate natural language into interface operations. Existing single-agent approaches, however, remain limited by structural constraints. Although multi-agent systems naturally decouple different competencies, recent progress in multi-agent reinforcement learning (MARL) has often been hindered by inefficiency and remains incompatible with current LVLM architectures. To address these challenges, we introduce SWIRL, a staged workflow for interleaved reinforcement learning designed for multi-agent systems. SWIRL reformulates MARL into a sequence of single-agent reinforcement learning tasks, updating one agent at a time while keeping the others fixed. This formulation enables stable training and promotes efficient coordination across agents. Theoretically, we provide a stepwise safety bound, a cross-round monotonic improvement theorem, and convergence guarantees on return, ensuring robust and principled optimization. In application to mobile GUI control, SWIRL instantiates a Navigator that converts language and screen context into structured plans, and an Interactor that grounds these plans into executable atomic actions. Extensive experiments demonstrate superior performance on both high-level and low-level GUI benchmarks. Beyond GUI tasks, SWIRL also demonstrates strong capability in multi-agent mathematical reasoning, underscoring its potential as a general framework for developing efficient and robust multi-agent systems.

SWIRL: A Staged Workflow for Interleaved Reinforcement Learning in Mobile GUI Control

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册