A$^2$TGPO: Agentic Turn-Group Policy Optimization with Adaptive Turn-level Clipping

作者: Dingwei Chen, Zefang Zong, Zhipeng Ma, Leo Luo, Yang Li, Chengming Li, Peng Chen, Jie Jiang

分类: cs.CL

发布日期: 2026-05-07

💡 一句话要点

提出A$^2$TGPO算法，通过自适应轮次裁剪优化智能体大模型的强化学习过程奖励分配。

🎯 匹配领域: 支柱二：RL算法与架构 (RL & Architecture) 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 强化学习 智能体大模型 过程信用分配 信息增益 策略优化 多轮交互

📋 核心要点

现有方法在处理智能体多轮交互时，面临过程信用分配困难，且外部奖励模型成本高昂或树状结构限制了探索空间。
本文提出A$^2$TGPO，通过轮次组归一化、方差重缩放累积和自适应裁剪机制，优化了基于信息增益的内在奖励分配。
该方法在保持训练效率的同时，显著提升了智能体在复杂多步任务中的决策能力，优于现有的基线强化学习算法。

📝 摘要（中文）

智能体大语言模型（LLMs）的强化学习通常依赖稀疏的轨迹级结果奖励，难以评估多轮交互中单个工具调用的贡献。现有过程信用分配方法要么依赖昂贵的外部过程奖励模型，要么通过树状结构回溯限制了轨迹多样性。一种替代方案是利用策略对真值预测概率的变化（信息增益，IG）作为内在过程信号。然而，在RL训练循环中应用IG面临三大挑战：不同位置上下文导致的归一化偏差、随轨迹深度累积的优势幅度漂移，以及固定裁剪范围无法适应不同信息量的轮次。本文提出A$^2$TGPO，通过轮次组归一化、方差重缩放折扣累积以及自适应轮次裁剪，有效解决了上述问题，实现了更精准的策略优化。

🔬 方法详解

问题定义：论文旨在解决智能体大模型在多轮交互中，如何利用内在信号（信息增益）进行高效的过程信用分配。现有方法在处理不同深度轮次的归一化、优势函数幅度漂移以及统一裁剪策略上存在局限，导致训练不稳定且难以捕捉关键步骤的贡献。

核心思路：核心思想是将信息增益（IG）作为内在奖励，通过对齐同深度轮次的分布、动态调整累积优势的方差，并根据IG大小自适应调整策略更新的裁剪范围，从而实现更精细的策略引导。

技术框架：A$^2$TGPO框架包含三个核心模块：轮次组归一化模块，用于消除位置偏差；方差重缩放累积模块，用于稳定不同深度下的优势值；以及自适应裁剪模块，根据轮次信息量动态调节PPO的裁剪阈值。

关键创新：最重要的创新在于引入了“自适应轮次裁剪”，将策略更新的灵活性与轮次的重要性（IG值）直接挂钩，使得模型在关键决策点更新更充分，而在冗余步骤更新更保守。

关键设计：关键技术细节包括：对同一Prompt下相同轮次索引的IG进行组内归一化；采用累积项数量平方根的倒数对折扣累积奖励进行重缩放；以及定义了一个与归一化IG正相关的动态裁剪函数，以实现对策略更新幅度的精细化控制。

🖼️ 关键图片

📊 实验亮点

实验表明，A$^2$TGPO在多项智能体基准测试中表现优异。相比于传统的PPO及基于固定裁剪的基线方法，A$^2$TGPO在处理长轨迹任务时展现出更快的收敛速度和更高的任务成功率。特别是在复杂工具调用场景下，该方法通过更精准的信用分配，显著减少了无效探索，在保持策略多样性的同时提升了最终性能。

🎯 应用场景

该研究适用于需要多步推理与工具调用的智能体系统，如自动化代码生成、复杂任务规划、科学实验设计及交互式数据分析。通过提升过程信用分配的准确性，该方法能显著增强智能体在长序列任务中的鲁棒性与成功率，降低对昂贵人工标注或外部奖励模型的依赖，具有广泛的工业应用前景。

📄 摘要（原文）

Reinforcement learning for agentic large language models (LLMs) typically relies on a sparse, trajectory-level outcome reward, making it difficult to evaluate the contribution of individual tool-calls within multi-turn interactions. Existing approaches to such process credit assignment either depend on separate external process reward models that introduce additional consumption, or tree-based structural rollout that merely redistributes the outcome signal while constraining trajectory diversity. A promising alternative leverages the per-turn change in the policy's predicted probability of the ground-truth, termed Information Gain (IG), as an intrinsic process signal without an external evaluator. However, prior work on leveraging IG signals within the RL training loop faces three systematic challenges: normalizing across turns that face heterogeneous positional contexts can distort the relative standing of individual turns, accumulating a variable number of terms causes advantage magnitudes to drift with trajectory depth, and a fixed clipping range governs policy updates identically for turns with vastly different IG signals. In this paper, we propose A$^2$TGPO (Agentic Turn-Group Policy Optimization with Adaptive Turn-level Clipping), which retains IG as the intrinsic signal but re-designs how it is normalized, accumulated, and consumed: (i) turn-group normalization: normalizes IG within each (prompt, turn-index) group so that each turn is compared only against peers at the same interaction depth; (ii) variance-rescaled discounted accumulation: divides cumulative normalized IG by square root of accumulated terms to keep advantage magnitudes comparable across turn positions; and (iii) adaptive turn-level clipping: modulates each turn's clipping range based on its normalized IG, widening the update region for informative turns and narrowing it for uninformative ones.

A$^2$TGPO: Agentic Turn-Group Policy Optimization with Adaptive Turn-level Clipping

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理