Semantic Bridge: Universal Multi-Hop Question Generation via AMR-Driven Graph Synthesis

作者: Linqing Chen, Hanmeng Zhong, Wentao Wu, Weilei Wang

分类: cs.CL

发布日期: 2025-08-06

💡 一句话要点

提出语义桥接框架以解决多跳推理问题

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 多跳推理 问答生成 语义图 大型语言模型 可控生成 领域特定数据 推理链

📋 核心要点

现有方法依赖表面模式，无法生成可控的复杂多跳推理问题，限制了LLM训练的有效性。
提出了语义桥接框架，通过语义图编织实现从任意来源生成复杂推理问题，具备可控性。
在多个数据集上进行广泛评估，生成的问题对在复杂性和可回答性上均显著优于基线，提升幅度达18.3%-25.4%。

📝 摘要（中文）

大型语言模型（LLM）训练面临一个关键瓶颈：高质量、推理密集型问答对的稀缺，尤其是在稀疏的领域特定来源如PubMed论文或法律文档中。现有方法依赖表面模式，未能生成可控的复杂多跳推理问题。我们提出了语义桥接（Semantic Bridge），这是第一个可控生成复杂多跳推理问题的通用框架。我们的创新在于语义图编织，通过三种互补的桥接机制（实体桥接、谓词链桥接和因果桥接），系统构建文档间的复杂路径。我们的多模态AMR管道在质量上提升了9.5%，并在多个数据集上表现出色，生成的问题对在复杂性、可回答性和模式覆盖率上均优于基线。

🔬 方法详解

问题定义：本论文旨在解决大型语言模型训练中高质量推理问题对的稀缺问题。现有方法往往依赖于表面模式，无法生成复杂的多跳推理问题，限制了模型的理解能力和应用场景。

核心思路：提出语义桥接框架，通过语义图编织技术，利用三种桥接机制（实体桥接、谓词链桥接和因果桥接）系统性地构建文档间的复杂推理路径，从而实现可控的多跳推理问题生成。

技术框架：整体架构包括一个多模态AMR管道，首先进行语义分析，然后通过三种桥接机制构建推理路径，最后生成问题对。每个模块都经过精心设计，以确保生成问题的复杂性和可控性。

关键创新：最大的技术创新在于引入了语义图编织的概念，通过三种互补的桥接机制，系统性地解决了现有方法无法生成复杂多跳推理问题的局限性。

关键设计：在模型设计中，采用了特定的损失函数来优化生成问题的质量，并通过调节桥接机制的参数设置，实现对问题复杂性和类型的精细控制。

📊 实验亮点

实验结果显示，生成的问题对在复杂性上提高了23.4%，可回答性提升了18.7%，模式覆盖率提高了31.2%。在多个语言（英语、中文、法语、德语）上，生成的问题对在四个语言上均表现出18.3%-25.4%的性能提升，且生成的问答对在质量上优于600个人工标注示例，材料使用量减少67%。

🎯 应用场景

该研究的潜在应用领域包括教育、法律和医学等需要复杂推理的问题生成场景。通过提供高质量的问答对，能够有效提升大型语言模型的训练效果，进而推动智能问答系统的发展，具有重要的实际价值和未来影响。

📄 摘要（原文）

Large language model (LLM) training faces a critical bottleneck: the scarcity of high-quality, reasoning-intensive question-answer pairs, especially from sparse, domain-specific sources like PubMed papers or legal documents. Existing methods rely on surface patterns, fundamentally failing to generate controllable, complex multi-hop reasoning questions that test genuine understanding-essential for advancing LLM training paradigms. We present \textbf{Semantic Bridge}, the first universal framework for controllably generating sophisticated multi-hop reasoning questions from arbitrary sources. Our breakthrough innovation is \textit{semantic graph weaving}-three complementary bridging mechanisms (entity bridging for role-varying shared entities, predicate chain bridging for temporal/causal/logical sequences, and causal bridging for explicit reasoning chains)-that systematically construct complex pathways across documents, with fine-grained control over complexity and types via AMR-driven analysis. Our multi-modal AMR pipeline achieves up to 9.5% better round-trip quality, enabling production-ready controllable QA generation. Extensive evaluation demonstrates performance across both general-purpose datasets (Wikipedia) and specialized domains (biomedicine) It yields consistent 18.3%-25.4% gains over baselines across four languages (English, Chinese, French, German). Question pairs generated from 200 sources outperform 600 native human annotation examples with 67% fewer materials. Human evaluation shows 23.4% higher complexity, 18.7% better answerability, and 31.2% improved pattern coverage. Semantic Bridge establishes a new paradigm for LLM training data synthesis, enabling controllable generation of targeted reasoning questions from sparse sources. We will release our core code and semantic bridge model.

Semantic Bridge: Universal Multi-Hop Question Generation via AMR-Driven Graph Synthesis

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册