Improving LLM Reasoning via Dependency-Aware Query Decomposition and Logic-Parallel Content Expansion

📄 arXiv: 2510.24390v1 📥 PDF

作者: Xianjun Gao, Jianchun Liu, Hongli Xu, Liusheng Huang

分类: cs.AI

发布日期: 2025-10-28


💡 一句话要点

提出Orion框架以解决LLM推理效率与质量的矛盾问题

🎯 匹配领域: 支柱九:具身大模型 (Embodied Foundation Models)

关键词: 大型语言模型 推理框架 查询分解 内容扩展 依赖关系 实时Web应用 性能优化

📋 核心要点

  1. 现有LLM推理方法在效率和质量之间难以平衡,导致Web服务性能受限。
  2. Orion框架通过依赖感知的查询分解和逻辑并行内容扩展,提升推理效率与质量。
  3. 实验结果显示,Orion在生成速度和延迟方面显著优于基线方法,同时推理质量也有显著提升。

📝 摘要(中文)

将大型语言模型(LLMs)集成到实时Web应用中,如AI驱动的搜索和对话代理,面临着高质量复杂推理与低延迟高吞吐量的需求之间的挑战。现有的LLM推理由于计算效率低下的顺序生成和僵化的推理策略,成为Web服务的瓶颈。为此,本文提出了Orion,一个新颖高效的推理框架,通过依赖感知的查询分解和逻辑并行内容扩展来克服这些限制。Orion将单一查询推理过程分解为两个协同阶段:关键点生成和内容并行扩展。实验表明,Orion在多个基准测试中实现了高达4.33倍的生成速度提升和3.42倍的延迟降低,同时推理质量提高了18.75%。

🔬 方法详解

问题定义:本文旨在解决大型语言模型在实时Web应用中推理效率低下和质量不高的问题。现有方法通常只能优化效率或质量,难以兼顾。

核心思路:Orion框架通过将查询推理过程分解为关键点生成和内容并行扩展两个阶段,利用依赖关系提高推理的逻辑一致性和效率。

技术框架:Orion的整体架构包括两个主要模块:第一阶段为关键点生成,使用检索增强的少量提示生成逻辑结构化的关键点;第二阶段为内容并行扩展,基于依赖图同时扩展这些关键点。

关键创新:Orion的创新在于引入了依赖感知的查询分解和逻辑并行内容扩展,能够在多个查询之间实现交叉查询并行,从而显著提升推理性能。

关键设计:在设计中,Orion采用了管道调度机制,充分利用两个阶段的计算特性,优化了GPU计算和内存的使用。

🖼️ 关键图片

fig_0
fig_1
fig_2

📊 实验亮点

实验结果表明,Orion在多个基准测试中实现了高达4.33倍的生成速度提升和3.42倍的延迟降低,同时推理质量提高了18.75%。这些结果表明Orion在处理复杂推理任务时的显著优势。

🎯 应用场景

Orion框架在AI驱动的搜索引擎、对话系统及其他实时Web应用中具有广泛的潜在应用价值。通过提升推理效率和质量,Orion能够为用户提供更快速、更准确的响应,改善用户体验,并推动智能服务的进一步发展。

📄 摘要(原文)

The integration of Large Language Models (LLMs) into real-time Web applications, such as AI-powered search and conversational agents, presents a fundamental Web infrastructure challenge: reconciling the demand for high-quality, complex reasoning with the stringent low-latency and high-throughput requirements of interactive services. Current LLM reasoning, hindered by computationally inefficient sequential generation and rigid reasoning strategies, creates a critical bottleneck for the Web services. Existing approaches typically optimize the LLM reasoning for either efficiency or quality but struggle to achieve both, and thus fail to meet the dual requirements of modern Web platforms. To overcome these limitations, we propose Orion, a novel and efficient reasoning framework that enables dependency-aware query decomposition and logic-parallel content expansion. Concretely, Orion decomposes a single query reasoning process into two synergistic phases: (1) \textit{key point generation}, which distills logically structured key points through retrieval-augmented few-shot prompting, and (2) \textit{content parallel expansion}, which concurrently elaborates on these points based on a dependency graph to ensure logical consistency. Furthermore, Orion introduces a pipeline scheduling mechanism that exploits the complementary computational characteristics of the two phases (generation imposes pressure on GPU computing and expansion stresses on GPU memory) across multiple queries, enabling cross-query parallelism and dramatically improving reasoning performance (\ie, efficiency and quality). Experiments on diverse benchmarks show that Orion not only delivers up to 4.33x higher token generation speed and 3.42x lower answer latency over the baselines but also improves reasoning quality by up to 18.75% through explicitly modeling inter-point dependencies.