SD-Search: On-Policy Hindsight Self-Distillation for Search-Augmented Reasoning
作者: Yufei Ma, Zihan Liang, Ben Chen, Zhipeng Qian, Huangyu Dai, Lingtao Mao, Xuxin Zhang, Chenyi Lei, Wenwu Ou
分类: cs.AI, cs.CL, cs.IR
发布日期: 2026-05-18
💡 一句话要点
SD-Search:基于On-Policy Hindsight Self-Distillation的搜索增强推理
🎯 匹配领域: 支柱二:RL算法与架构 (RL & Architecture)
关键词: 搜索增强推理 强化学习 自蒸馏 hindsight学习 On-Policy学习
📋 核心要点
- 现有搜索增强推理方法缺乏对每个查询的细粒度监督信号,导致学习效率低下。
- SD-Search通过on-policy hindsight self-distillation,利用自身策略生成步骤级别的监督信号,无需额外资源。
- 该方法在标准强化学习训练循环中实现,无需外部模型、注释或额外训练阶段,易于集成。
📝 摘要(中文)
搜索增强推理智能体通过交错内部推理和外部检索调用来工作,其性能依赖于每个查询的质量。然而,在基于结果奖励的强化学习中,rollout中的每个搜索决策共享相同的轨迹级别奖励,导致单个查询缺乏步骤特定的信用。为了解决这个问题,现有的过程监督方法依赖于策略之外的步骤级别信号,要么依赖于更大的教师模型,要么依赖于更强大的外部系统生成的子问题注释。本文提出了SD-Search,它通过on-policy hindsight self-distillation从策略本身导出步骤级别监督,既不需要外部教师也不需要额外的注释。在SD-Search中,单个模型扮演两个角色:一个学生,只看到推理时可用的上下文;一个教师,额外地以一个紧凑的hindsight block为条件,该block总结了来自同一问题的rollout组的搜索查询和最终结果。由于教师知道每个rollout是如何展开的以及哪些rollout成功了,它的查询分布隐式地标记了哪些决策是值得做的,并且学生被训练通过最小化搜索查询位置的token级别Jensen--Shannon散度来恢复这种行为。这在GRPO的粗略轨迹奖励之上分层了一个密集的、步骤级别的信号。至关重要的是,这个信号是由策略本身在标准RL训练循环中产生的,不需要外部模型推理、辅助注释管道或额外的训练阶段。
🔬 方法详解
问题定义:搜索增强推理智能体在强化学习训练中,由于奖励是轨迹级别的,每个查询的贡献难以评估,导致查询质量不高。现有方法依赖外部资源(如更大的教师模型或子问题标注)来提供步骤级别的监督信号,增加了复杂性和成本。
核心思路:利用策略自身生成监督信号,通过hindsight self-distillation,让模型扮演教师和学生两个角色。教师模型以包含搜索查询和结果的hindsight信息为条件,学习更优的查询策略。学生模型则在推理时仅使用原始上下文,通过模仿教师模型的行为来提升查询质量。
技术框架:SD-Search在标准的强化学习训练循环中进行。首先,从环境中采样一组rollout。然后,构建hindsight block,包含每个rollout的搜索查询和最终结果。教师模型以原始上下文和hindsight block为条件,生成查询分布。学生模型仅以原始上下文为条件,生成查询分布。最后,通过最小化教师和学生模型在搜索查询位置的token级别Jensen--Shannon散度,实现知识蒸馏。
关键创新:核心创新在于利用on-policy hindsight self-distillation,从策略自身生成步骤级别的监督信号,无需外部资源。与现有方法相比,SD-Search更加高效、简洁,易于集成到现有的强化学习框架中。
关键设计:关键设计包括:1) Hindsight block的构建方式,需要选择合适的搜索查询和结果进行总结;2) 教师和学生模型之间的知识蒸馏方式,选择token级别的Jensen--Shannon散度作为损失函数;3) 如何平衡轨迹级别的奖励和步骤级别的蒸馏损失。
🖼️ 关键图片
📊 实验亮点
论文提出的SD-Search方法,通过on-policy hindsight self-distillation,有效提升了搜索增强推理智能体的性能。实验结果表明,该方法在没有外部资源的情况下,能够达到甚至超过依赖外部监督信号的现有方法,证明了其有效性和优越性。
🎯 应用场景
SD-Search可应用于各种需要搜索增强推理的场景,例如问答系统、知识图谱推理、机器人导航等。通过提升查询质量,可以提高智能体的推理能力和决策水平,使其在复杂环境中更好地完成任务。该方法具有广泛的应用前景,能够提升AI系统的智能化水平。
📄 摘要(原文)
Search-augmented reasoning agents interleave internal reasoning with calls to an external retriever, and their performance relies on the quality of each issued query. However, under outcome-reward reinforcement learning, every search decision in a rollout shares the same trajectory-level reward, leaving individual queries without step-specific credit. Recent process-supervision approaches address this gap by drawing step-level signals from outside the policy, relying either on a much larger teacher model, or on sub-question annotations produced by a stronger external system. In contrast, we propose SD-Search, which derives step-level supervision from the policy itself through on-policy hindsight self-distillation, requiring neither an external teacher nor additional annotations. In SD-Search, a single model plays two roles that differ only in conditioning: a student that sees only the context available at inference time, and a teacher that additionally conditions on a compact hindsight block summarizing the search queries and final outcomes of a group of rollouts sampled from the same question. Since the teacher knows how each rollout unfolded and which ones succeeded, its query distribution implicitly marks which decisions were worth making, and the student is trained to recover this behavior by minimizing the token-level Jensen--Shannon divergence to the teacher at search-query positions. This layers a dense, step-level signal on top of GRPO's coarse trajectory reward. Crucially, this signal is produced by the policy itself within the standard RL training loop, without external model inference, auxiliary annotation pipeline, or additional training stage.