TRACE: A Unified Rollout Budget Allocation Framework for Efficient Agentic Reinforcement Learning
作者: Heming Zou, Qi Wang, Yun Qu, Yuhang Jiang, Lizhou Cai, Yixiu Mao, Ru Peng, Xin Xu, Weijie Liu, Kai Yang, Saiyong Yang, Xiangyang Ji
分类: cs.LG, cs.AI, cs.CL
发布日期: 2026-06-09
备注: 32 pages, 12 figures, 6 tables
💡 一句话要点
提出TRACE框架以解决多轮强化学习中的预算分配问题
🎯 匹配领域: 支柱二:RL算法与架构 (RL & Architecture) 支柱九:具身大模型 (Embodied Foundation Models)
关键词: 强化学习 奖励对比 树状结构 多轮对话 智能决策 模型优化 预算分配
📋 核心要点
- 现有方法在多轮强化学习中面临奖励对比不足的问题,导致策略优化效果不佳。
- 本文提出TRACE框架,通过树状结构优化回合预算分配,提升奖励对比效果。
- 实验结果显示,TRACE在多个自主基准上表现优异,提升了Qwen3-14B的准确率。
📝 摘要(中文)
强化学习与可验证奖励(RLVR)是一种增强大型语言模型推理和自主行为的有效方法。然而,现有的策略优化往往受到奖励对比不足的限制,尤其是在多轮回合中,简单或复杂的提示可能导致低方差反馈。本文提出TRACE(树形回合分配框架),通过将每个思考-行动-观察回合建模为语义上独立的节点,扩展预算分配至提示根和中间前缀,从而形成树状回合结构。TRACE在固定采样预算内增强奖励对比,实验证明其在典型自主基准上表现出色,提升了Qwen3-14B多跳问答的平均准确率2.8个百分点。
🔬 方法详解
问题定义:本文旨在解决多轮强化学习中由于奖励对比不足而导致的策略优化效率低下的问题。现有方法主要集中在提示级别的样本信息利用,忽视了同一回合内前缀级别的信息变异性。
核心思路:TRACE框架通过将每个思考-行动-观察回合视为语义独立的节点,允许预算分配从提示根扩展到回合级前缀,形成树状回合结构,从而增强奖励对比。
技术框架:TRACE的整体架构包括预算分配模块和共享预测器。预算分配模块负责将回合预算分配给最有可能产生混合终端奖励的提示根和中间前缀,而共享预测器则根据前缀历史估计条件成功概率,以指导预算分配。
关键创新:TRACE的主要创新在于其树状结构的自适应设计,使得奖励反馈更加丰富,增强了策略更新信号。这一设计与传统方法的线性回合结构形成鲜明对比。
关键设计:TRACE的关键设计包括共享预测器的构建,利用前缀历史进行条件成功概率的估计,以及在固定采样预算下的动态预算分配策略。
🖼️ 关键图片
📊 实验亮点
TRACE在多个自主基准上表现出色,特别是在Qwen3-14B多跳问答任务中,平均准确率提升了2.8个百分点,且在相同采样成本下实现了效率提升,展示了其在强化学习领域的竞争力。
🎯 应用场景
TRACE框架在多轮强化学习中的应用潜力巨大,尤其是在需要复杂推理和决策的场景中,如对话系统、智能助手和自动问答系统。通过优化预算分配,TRACE能够提升模型的决策质量和响应效率,具有广泛的实际价值和未来影响。
📄 摘要(原文)
Reinforcement learning with verifiable rewards (RLVR) is a promising approach for enhancing reasoning and agentic behavior in large language models. However, rollout-intensive policy optimization is often limited by insufficient reward contrast, arising when overly simple or complex prompts generate low-variance feedback and when outcome-only rewards assign the same terminal assessment to every decision in a multi-turn rollout. Past efforts have focused on allocating available rollout resources to promising prompts, yet they only leverage sample informativeness at the prompt level and neglect variation in prefix-level informativeness across turns within the same rollout. This work targets multi-turn agentic RL by modeling each ReAct-style thought-action-observation turn as a semantically distinct node, allowing budget allocation to extend from prompt roots to turn-level prefixes with further continuations, which naturally forms tree-structured rollouts. We introduce Tree Rollout Allocation for Contrastive Exploration (TRACE), a unified rollout allocation framework that enhances reward contrast within a fixed sampling budget. Technically, TRACE allocates rollout budget to both prompt roots and intermediate prefixes that are most likely to yield mixed terminal rewards. A shared generalizable predictor estimates conditional success probability at these anchors from prefix histories to guide this allocation. The resulting adaptive tree structure enriches outcome-only feedback and amplifies the policy-update signal. Empirically, TRACE achieves competitive performance and efficiency gains on typical agentic benchmarks, e.g., improving Qwen3-14B Multi-Hop QA average accuracy by 2.8 points over competitive baselines at equal sampling cost.