AppCopilot: Toward General, Accurate, Long-Horizon, and Efficient Mobile Agent

作者: Jingru Fan, Yufan Dang, Jingyao Wu, Huatao Li, Runde Yang, Xiyuan Yang, Yuheng Wang, Chen Qian

分类: cs.AI, cs.CL, cs.CV, cs.HC

发布日期: 2025-09-02 (更新: 2025-10-17)

备注: Project at https://github.com/OpenBMB/AppCopilot

💡 一句话要点

AppCopilot：面向通用、精确、长程和高效的移动Agent

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 移动Agent 多模态模型 多Agent协作 分层任务规划 通用人工智能

📋 核心要点

现有移动Agent在跨任务泛化、屏幕交互精度、长程任务处理和资源受限设备上的效率方面存在不足。
AppCopilot通过多模态模型、多Agent协作和分层任务规划，构建通用、精确、长程和高效的移动Agent。
实验表明，AppCopilot在泛化性、精度、长程任务完成度和运行时效率方面均有显著提升。

📝 摘要（中文）

随着大型语言模型和多模态模型的快速发展，移动Agent领域蓬勃发展，但尚未解决根本挑战。本文确定了移动Agent要实现实际、可扩展的影响需要解决的四个核心问题：（1）跨任务、APP和设备的泛化性；（2）准确性，特别是精确的屏幕交互和点击目标定位；（3）持续、多步骤目标的长程能力；（4）效率，特别是在资源受限设备上的高性能运行时。我们提出了AppCopilot，一个跨应用的多模态、多Agent、通用移动Agent。AppCopilot通过一个端到端流程来实现这一目标，该流程涵盖数据收集、训练、微调、高效推理以及PC/移动应用。在模型层，它集成了具有强大中英文支持的多模态基础模型。在推理和控制层，它结合了思维链推理、分层任务规划和分解以及多Agent协作。在执行层，它实现了经验适应、语音交互、函数调用、跨APP和跨设备编排以及全面的移动APP支持。系统设计结合了剖析驱动的优化，以实现跨异构硬件的延迟和内存优化。经验表明，AppCopilot在四个维度上取得了显著改进：更强的泛化性、更高的屏幕操作精度、更可靠的长程任务完成以及更快、更节省资源的运行时。通过阐明一个连贯的立场和一个从数据收集、训练到微调和高效推理的闭环参考架构，本文为通用移动Agent提供了一个具体的路线图，并提供了可操作的指导。

🔬 方法详解

问题定义：现有移动Agent难以在不同APP、任务和设备上泛化，屏幕交互精度不足，无法完成复杂长程任务，且在移动设备上运行效率较低。这些问题限制了移动Agent的实际应用价值。

核心思路：AppCopilot的核心思路是构建一个通用的多模态移动Agent，通过结合多模态基础模型、多Agent协作和分层任务规划，提升Agent的泛化性、精度、长程任务处理能力和效率。这样设计旨在模拟人类在移动设备上的操作方式，从而更好地理解和执行用户意图。

技术框架：AppCopilot的整体架构包含数据收集、训练、微调、高效推理和PC/移动应用等阶段。模型层集成了多模态基础模型，推理和控制层结合了思维链推理、分层任务规划和多Agent协作，执行层实现了经验适应、语音交互、函数调用、跨APP和跨设备编排。系统设计采用剖析驱动的优化方法，针对异构硬件进行延迟和内存优化。

关键创新：AppCopilot的关键创新在于其端到端的系统设计，以及多模态、多Agent的融合。它不仅关注模型层面的优化，还注重推理和控制策略的设计，以及执行层的能力扩展。这种综合性的方法使得AppCopilot能够更好地应对移动Agent面临的各种挑战。

关键设计：AppCopilot在模型层采用了支持中英文的多模态基础模型，具体模型选择未知。在推理和控制层，采用了分层任务规划，将复杂任务分解为多个子任务，并分配给不同的Agent执行。在执行层，通过经验适应机制，Agent可以根据历史经验调整操作策略。具体的参数设置、损失函数和网络结构等细节信息未知。

📊 实验亮点

AppCopilot在泛化性、屏幕操作精度、长程任务完成度和运行时效率四个维度上均取得了显著改进。具体的性能数据和对比基线信息未知，但论文强调了AppCopilot在各项指标上的提升，表明其在移动Agent领域具有较强的竞争力。

🎯 应用场景

AppCopilot可应用于自动化移动应用测试、智能助手、移动设备自动化操作等领域。它能够帮助用户自动完成各种移动设备上的任务，提高工作效率，降低操作难度。未来，AppCopilot有望成为移动设备上重要的智能交互方式，并推动移动Agent技术的进一步发展。

📄 摘要（原文）

With the raid evolution of large language models and multimodal models, the mobile-agent landscape has proliferated without converging on the fundamental challenges. This paper identifies four core problems that should be solved for mobile agents to deliver practical, scalable impact: (1) generalization across tasks, APPs, and devices; (2) accuracy, specifically precise on-screen interaction and click targeting; (3) long-horizon capability for sustained, multi-step goals; and (4) efficiency, specifically high-performance runtime on resource-constrained devices. We present AppCopilot, a multimodal, multi-agent, general-purpose mobile agent that operates across applications. AppCopilot operationalizes this position through an end-to-end pipeline spanning data collection, training, finetuning, efficient inference, and PC/mobile application. At the model layer, it integrates multimodal foundation models with robust Chinese-English support. At the reasoning and control layer, it combines chain-of-thought reasoning, hierarchical task planning and decomposition, and multi-agent collaboration. At the execution layer, it enables experiential adaptation, voice interaction, function calling, cross-APP and cross-device orchestration, and comprehensive mobile APP support. The system design incorporates profiling-driven optimization for latency and memory across heterogeneous hardware. Empirically, AppCopilot achieves significant improvements on four dimensions: stronger generalization, higher precision of on screen actions, more reliable long horizon task completion, and faster, more resource efficient runtime. By articulating a cohesive position and a reference architecture that closes the loop from data collection, training to finetuning and efficient inference, this paper offers a concrete roadmap for general purpose mobile agent and provides actionable guidance.

AppCopilot: Toward General, Accurate, Long-Horizon, and Efficient Mobile Agent

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理