Scaling Multi-Hop Training Data via Graph-Constrained Path Selection

📄 arXiv: 2605.31238v1 📥 PDF

作者: Pengyu Chen, Yonggang Zhang, Mingming Chen, Jun Song, Wei Xue, Yike Guo

分类: cs.CL, cs.LG

发布日期: 2026-05-29

备注: 21 pages, 5 figures

🔗 代码/项目: GITHUB


💡 一句话要点

提出图约束路径选择以扩展多跳训练数据

🎯 匹配领域: 支柱九:具身大模型 (Embodied Foundation Models)

关键词: 多跳推理 图约束 语言模型 数据扩展 法律文本分析

📋 核心要点

  1. 现有方法在处理结构化文档时效果显著下降,尤其是当文档围绕重复模板和密集交叉引用条款构建时。
  2. 本文提出通过图结构离线枚举推理路径,并仅在验证后调用教师模型进行表述,从而解耦推理与表述的过程。
  3. 在CUAD法律合同语料库上,使用新方法构建的80K示例使闭卷Token F1从21.66%提升至38.58%,显示出显著的性能提升。

📝 摘要(中文)

为大型语言模型在特定文档上进行组合推理,需要大规模的多跳训练数据,而此类数据在结构化来源之外很少存在。现有方法通过单一教师模型共同发现证据路径并将其表述为问答对,但在处理结构化文档时效果显著下降。本文将推理路径的枚举与教师模型的表述操作解耦,通过图结构对上下文关键词中心进行离线枚举,并仅在验证路径后调用教师模型进行表述。实验表明,经过这种方法构建的80K示例在CUAD法律合同语料库上,闭卷Token F1从21.66%提升至38.58%。

🔬 方法详解

问题定义:本文旨在解决大型语言模型在特定文档上进行组合推理时缺乏大规模多跳训练数据的问题。现有方法在处理结构化文档时,尤其是重复模板和密集交叉引用的情况下,性能显著下降。

核心思路:论文的核心思路是将推理路径的枚举与教师模型的表述操作解耦。通过图结构对上下文关键词中心进行离线枚举,确保路径的有效性后再调用教师模型进行表述,从而提高了数据构建的效率和质量。

技术框架:整体架构包括两个主要模块:首先,离线枚举推理路径的图结构,其次,基于验证路径的教师模型表述。图结构中引入了五个几何可接受性约束,以确保路径的合理性。

关键创新:最重要的技术创新在于通过图约束提高了教师模型的合成能力,而不是单纯提升每条链的内容质量。这种方法使得可用语料库扩大了4.4倍,显著提高了训练数据的规模。

关键设计:在设计中,采用了Gram矩阵的论证来支持局部相似性界限的设定,确保在密集嵌入的情况下能够有效退出文本的模板化部分。

🖼️ 关键图片

fig_0
fig_1
fig_2

📊 实验亮点

实验结果显示,在CUAD法律合同语料库上,使用新方法构建的80K示例使闭卷Token F1从21.66%提升至38.58%,提升幅度达到约77%。这一结果表明,图约束路径选择在多跳训练数据构建中的有效性,显著提高了模型的推理性能。

🎯 应用场景

该研究的潜在应用领域包括法律文本分析、医学文献检索和技术文档理解等。通过构建大规模的多跳训练数据,能够显著提升大型语言模型在特定领域的推理能力,进而推动智能问答系统和信息检索技术的发展。未来,该方法有望在更多领域实现广泛应用,提升模型的实用性和准确性。

📄 摘要(原文)

Endowing large language models with compositional reasoning over specialized documents requires multi-hop training data at scale, where such data rarely exists outside of curated benchmarks built on structured sources. To construct it directly from plain, unannotated text, existing methods ask a single teacher model to jointly discover an evidence path through a document and verbalize it as a question-answer pair. However, these methods degrade sharply when documents are structured around repetitive templates and densely cross-referencing clauses, conditions that characterize most real-world specialized corpora. In this work, we decouple the two operations: reasoning paths are enumerated offline over a graph of contextual keyword centroids, and the teacher is invoked only to verbalize pre-validated paths. The graph enforces five geometric admissibility constraints, for which we provide Gram-matrix arguments establishing that local similarity bounds alone admit endpoint drift up to ${\sim}91^{\circ}$, and that an upper similarity bound is necessary to exit dense embedding cliques formed by boilerplate text. A matched-size ablation isolates the mechanism: at equal training scale, constrained and unconstrained chains yield indistinguishable downstream performance, and the gain at full scale comes from a 4.4$\times$ expansion of the usable corpus rather than from higher per-chain quality -- reframing the role of graph constraints, in this setting, as raising teacher synthesizability rather than improving chain content. Fine-tuning Qwen3-32B on 80K examples constructed from the CUAD legal contract corpus improves closed-book Token F1 from 21.66% to 38.58%. We have released our codes at https://github.com/hkgai-official/GCSCS.