Interaction as Intelligence Part II: Asynchronous Human-Agent Rollout for Long-Horizon Task Training

作者: Dayuan Fu, Yunze Wu, Xiaojie Cai, Lyumanshan Ye, Shijie Xia, Zhen Huang, Weiye Si, Tianze Xu, Jie Sun, Keyu Li, Mohan Jiang, Junfei Wang, Qishuo Hua, Pengrui Lu, Yang Xiao, Pengfei Liu

分类: cs.AI

发布日期: 2025-10-31 (更新: 2025-11-03)

💡 一句话要点

提出Apollo框架，通过异步人机交互提升LLM Agent在长时任务中的训练效果

🎯 匹配领域: 支柱一：机器人控制 (Robot Control) 支柱二：RL算法与架构 (RL & Architecture) 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 人机交互 长时程任务 LLM Agent 异步训练 行为克隆

📋 核心要点

现有长时程任务训练方法依赖密集标注或结果驱动采样，前者成本高昂，后者易崩溃。
Apollo框架通过异步人机交互，允许人工在Agent偏离轨迹时介入指导，降低标注成本。
实验表明，Apollo在InnovatorBench上训练GLM-4.5模型时，性能显著优于基线方法。

📝 摘要（中文）

大型语言模型（LLM）Agent在自动化编程、深度研究和图形用户界面操作等领域展现出巨大潜力。然而，训练它们在长时程、领域专业化任务中取得成功仍然具有挑战性。目前的方法主要分为两类：一是依赖密集的行为克隆，但对于耗时数天甚至数月的长时程任务而言，成本过高；二是依赖结果驱动的采样，但由于领域专业化任务中有效正向轨迹的稀缺性，常常导致训练崩溃。我们提出了Apollo，一个集成了异步人工指导和动作级数据过滤的采样框架。Apollo允许标注者仅在Agent偏离有希望的轨迹时进行干预，提供先验知识和策略建议，而无需全程跟随。这种轻量级设计使得持续交互超过30小时成为可能，并以较低的成本产生有价值的轨迹。Apollo还应用监督控制来过滤掉次优动作，防止误差传播。实验表明，在InnovatorBench上训练GLM-4.5模型时，Apollo相比于未训练的基线提升超过50%，相比于无人交互的变体提升28%。

🔬 方法详解

问题定义：论文旨在解决LLM Agent在长时程、领域专业化任务中训练困难的问题。现有方法，如行为克隆需要大量人工标注，成本过高；而结果驱动的采样方法，由于正向轨迹稀疏，容易导致训练崩溃。

核心思路：论文的核心思路是引入异步人机交互，允许人类专家在Agent表现不佳时进行干预，提供指导和建议，从而更有效地探索环境，并生成高质量的训练数据。这种方式降低了人工标注的成本，同时避免了完全依赖Agent自主探索可能导致的失败。

技术框架：Apollo框架包含两个主要组成部分：异步人机交互和动作级数据过滤。首先，Agent在环境中自主探索，当Agent偏离有希望的轨迹时，人类专家可以介入，提供先验知识、策略建议等。其次，Apollo应用监督控制来过滤掉次优动作，防止误差传播。整个流程旨在以较低的成本收集高质量的训练数据。

关键创新：Apollo的关键创新在于其异步人机交互模式。与传统的行为克隆方法不同，Apollo不需要人类专家全程跟随Agent，而是允许专家在关键时刻进行干预。这种模式显著降低了人工标注的成本，并使得长时间的交互成为可能。

关键设计：Apollo的关键设计包括：1) 人工干预的时机选择，需要设计合理的指标来判断Agent是否偏离了有希望的轨迹；2) 人工干预的方式，需要提供有效的工具和界面，方便专家提供指导和建议；3) 动作级数据过滤，需要设计合适的算法来识别和过滤掉次优动作，防止误差传播。具体的参数设置、损失函数、网络结构等细节在论文中可能没有详细描述，属于未知信息。

🖼️ 关键图片

📊 实验亮点

实验结果表明，在InnovatorBench上训练GLM-4.5模型时，使用Apollo框架的Agent相比于未训练的基线提升超过50%，相比于无人交互的变体提升28%。这表明Apollo框架能够显著提升LLM Agent在长时程、领域专业化任务中的性能。

🎯 应用场景

Apollo框架可应用于各种需要长时程决策和领域专业知识的任务，例如自动化软件开发、复杂系统控制、科学研究等。通过人机协同，可以更有效地训练LLM Agent，使其在这些复杂任务中表现出色，从而提高生产效率和创新能力。

📄 摘要（原文）

Large Language Model (LLM) agents have recently shown strong potential in domains such as automated coding, deep research, and graphical user interface manipulation. However, training them to succeed on long-horizon, domain-specialized tasks remains challenging. Current methods primarily fall into two categories. The first relies on dense human annotations through behavior cloning, which is prohibitively expensive for long-horizon tasks that can take days or months. The second depends on outcome-driven sampling, which often collapses due to the rarity of valid positive trajectories on domain-specialized tasks. We introduce Apollo, a sampling framework that integrates asynchronous human guidance with action-level data filtering. Instead of requiring annotators to shadow every step, Apollo allows them to intervene only when the agent drifts from a promising trajectory, by providing prior knowledge, strategic advice, etc. This lightweight design makes it possible to sustain interactions for over 30 hours and produces valuable trajectories at a lower cost. Apollo then applies supervision control to filter out sub-optimal actions and prevent error propagation. Together, these components enable reliable and effective data collection in long-horizon environments. To demonstrate the effectiveness of Apollo, we evaluate it using InnovatorBench. Our experiments show that when applied to train the GLM-4.5 model on InnovatorBench, Apollo achieves more than a 50% improvement over the untrained baseline and a 28% improvement over a variant trained without human interaction. These results highlight the critical role of human-in-the-loop sampling and the robustness of Apollo's design in handling long-horizon, domain-specialized tasks.

Interaction as Intelligence Part II: Asynchronous Human-Agent Rollout for Long-Horizon Task Training

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理