AliyunConsoleAgent: Training Web Agents in Real-World Cloud Environments via Distillation and Reinforcement Learning
作者: Bojie Rong, Zheyu Shen, Qiaoping Wang, Pengfei Kang, Yang Xu, Yawen Wei, Hanyu Wu, Zhi Zhao, Leihao Pei, Linquan Jiang
分类: cs.AI
发布日期: 2026-06-08
💡 一句话要点
提出AliyunConsoleAgent以解决云环境中文档验证问题
🎯 匹配领域: 支柱二:RL算法与架构 (RL & Architecture) 支柱九:具身大模型 (Embodied Foundation Models)
关键词: 云计算 文档验证 强化学习 蒸馏训练 自动化测试 代理系统 策略优化
📋 核心要点
- 现有方法在云控制台文档验证中面临高昂的人工成本和低覆盖率,手动检查不足1%。
- 论文提出的AliyunConsoleAgent通过蒸馏和强化学习的两阶段训练,提升了文档验证的效率和准确性。
- 实验结果表明,AliyunConsoleAgent-32B在278个任务基准上实现了63.52%的平均成功率,比基线模型提高了20.24个百分点。
📝 摘要(中文)
我们提出了AliyunConsoleAgent,这是一个用于自动化文档验证的网络代理框架,旨在应对现实云控制台中的文档与用户界面之间的差异。由于主要云平台产品众多且功能迭代迅速,导致控制台用户界面与相应文档经常不一致。手动验证的覆盖率不足1%,而每年需要进行约400万次检查。我们提出了一个两阶段的训练范式:首先在蒸馏的前沿模型轨迹上进行监督微调(SFT),然后在真实云环境中使用基于群体相对策略优化(GRPO)和双通道结果奖励模型进行强化学习。我们的模型从机械指令跟随演变为具有云控制台和产品特定理解的自主决策能力。
🔬 方法详解
问题定义:本论文旨在解决云控制台中文档与实际操作不一致的问题。现有方法依赖于人工检查,导致效率低下和覆盖率不足。
核心思路:我们提出了一个两阶段的训练框架,首先通过监督微调(SFT)在蒸馏的前沿模型轨迹上进行训练,然后在真实云环境中应用强化学习,以提高代理的决策能力。
技术框架:整体架构包括两个主要阶段:第一阶段是监督微调,第二阶段是强化学习。我们还构建了一个高确定性的回放系统,以支持大规模的强化学习训练。
关键创新:引入了基于群体相对策略优化(GRPO)和双通道结果奖励模型的强化学习方法,显著提升了代理的自主决策能力。
关键设计:采用Terraform进行资源预配置,结合大语言模型(LLM)驱动的按需配置,确保训练信号的干扰最小化。同时,基于后端审计日志的规则奖励评估协议提供了客观的结果判断。
🖼️ 关键图片
📊 实验亮点
实验结果显示,AliyunConsoleAgent-32B在278个任务基准上达到了63.52%的平均成功率,相较于基线模型提升了20.24个百分点,且与最佳前沿模型的差距缩小至1.82个百分点,同时推理成本降低了92%。
🎯 应用场景
该研究的潜在应用领域包括云服务提供商的文档验证、自动化测试和用户支持等。通过提高文档与实际操作的一致性,AliyunConsoleAgent能够显著降低人工成本,提高用户体验,未来可能在更多云平台中推广应用。
📄 摘要(原文)
We present AliyunConsoleAgent, a web agent framework for automated documentation verification in real-world cloud consoles. Major cloud platforms encompass hundreds of products with rapid feature iteration, causing console UIs to frequently diverge from their corresponding documentation. Verifying that documented procedures accurately reflect the current console and can be executed end-to-end demands an estimated 4 million recurring inspections annually, yet manual coverage remains below 1%. While agent systems built on frontier proprietary models achieve high success rates, their prohibitive cost and data privacy constraints preclude large-scale deployment. We propose a two-stage training paradigm: supervised fine-tuning (SFT) on distilled frontier-model trajectories, followed by reinforcement learning using Group Relative Policy Optimization (GRPO) and a dual-channel outcome reward model in real cloud environments. To support large-scale RL training, we construct a high-determinism rollout system featuring Terraform-based resource pre-provisioning and LLM-driven on-demand provisioning, which effectively isolates environment noise from the training signal. We further introduce a rule-based reward evaluation protocol grounded in backend audit logs, providing objective, reward-hacking-resistant outcome judgment. Our model evolves from mechanical instruction following to autonomous decision-making with cloud console and product-specific understanding. Experiments on a challenging 278-task benchmark where the best frontier model achieves only 65.34% demonstrate that AliyunConsoleAgent-32B achieves a 63.52% mean success rate -- a 20.24 percentage-point improvement over the base model, narrowing the gap to the best frontier proprietary model to 1.82 pp (bootstrap 95% CI [-1.27, 7.39]) -- at 92% lower inference cost.