How Does Reasoning Flow? Tracing Attention-Induced Information Flow for Targeted RL in LLMs
作者: Zhichen Dong, Yang Li, Yuhan Sun, Weixun Wang, Yijia Luo, Zinian Peng, Taiheng Ye, Chao Yang, Wenbo Su, Yu Cheng, Bo Zheng, Junchi Yan
分类: cs.LG, cs.CL
发布日期: 2026-06-09
备注: 25 pages, 7 figures, 11 tables. Accepted at ICML 2026
💡 一句话要点
提出FlowTracer以解决大语言模型中的强化学习信用分配问题
🎯 匹配领域: 支柱二:RL算法与架构 (RL & Architecture) 支柱九:具身大模型 (Embodied Foundation Models)
关键词: 强化学习 大语言模型 信息流追踪 信用分配 推理任务 注意力机制 有向无环图
📋 核心要点
- 现有强化学习方法在大语言模型中未能有效区分关键推理步骤与常规内容,导致信用分配不准确。
- 本文提出FlowTracer,通过构建注意力引导的有向无环图,追踪信息流并基于全局结构进行令牌信用分配。
- 实验结果表明,FlowTracer在多个推理任务中显著提升了模型性能,能够更精确地聚焦于关键令牌。
📝 摘要(中文)
在大语言模型中,基于令牌的信用分配仍然是强化学习的一大障碍,现有方法通常将所有令牌视为平等,未能区分关键推理步骤与常规格式或流畅填充。本文提出FlowTracer,一个强化学习框架,通过注意力引导的有向无环图追踪目标答案的推理流。该方法通过全局结构从聚合的注意力权重中导出令牌信用,并重加权边的容量以保留对答案区域的影响。FlowTracer提取连接问题与答案的信息流骨干,揭示高影响中心和聚合检查点,从而在多个推理任务中实现一致的性能提升。
🔬 方法详解
问题定义:本文旨在解决大语言模型中令牌级信用分配的问题,现有方法往往忽视信息传播的全局结构,导致关键推理步骤与常规内容混淆。
核心思路:FlowTracer通过构建一个基于注意力的有向无环图,追踪目标答案的推理流,从而实现更精细的信用分配。该设计能够保留对答案区域的影响,同时保持局部流量守恒。
技术框架:FlowTracer的整体架构包括构建有向无环图、重加权边的容量、提取信息流骨干和基于流量通量评分令牌。主要模块包括信息流追踪、信用分配和奖励信号生成。
关键创新:FlowTracer的核心创新在于利用全局信息结构进行令牌信用分配,区别于现有的点对点启发式方法,能够更好地捕捉长距离依赖关系。
关键设计:在FlowTracer中,边的容量通过聚合注意力权重进行重加权,确保只保留对答案区域有影响的部分,同时设计了局部流量守恒机制,避免中间令牌因路径长度或无关分支而失去或获得有效质量。
🖼️ 关键图片
📊 实验亮点
实验结果显示,FlowTracer在多个推理任务中相较于基线方法实现了显著的性能提升,具体提升幅度达到XX%,有效验证了其在信息流追踪和信用分配方面的优势。
🎯 应用场景
该研究的潜在应用领域包括自然语言处理中的问答系统、对话系统和文本生成等任务。通过更精确的信用分配,FlowTracer能够提升模型在复杂推理任务中的表现,具有重要的实际价值和未来影响。
📄 摘要(原文)
Token-level credit assignment remains a key obstacle for reinforcement learning (RL) in large language models (LLMs), where RL recipes typically treat all tokens equally, failing to distinguish decisive reasoning steps from routine formatting or fluent filler. Recent attempts leverage model-internal signals to assign finer-grained credit, but these are often point-wise heuristics that ignore the global structure of information propagation. We propose FlowTracer, an RL framework that traces answer-targeted reasoning flow on an attention-induced directed acyclic graph in which nodes correspond to tokens and edge capacities come from aggregated attention weights and derives token credit from this global structure. The edge capacities are reweighted to retain only the influence that can reach the answer region, while enforcing local flow conservation so intermediate tokens neither lose nor gain effective mass due to path length or irrelevant branches. On this graph, FlowTracer extracts an information-flow backbone connecting the question to the answer and scores tokens by flow throughput, revealing high-impact hubs and aggregation checkpoints that mediate long-range dependencies. These derived importances are used to shape token-level rewards, enabling learning signals to focus precisely on the tokens that route information toward (or away from) correct answers and delivering consistent performance gains across a range of reasoning tasks.