Advancing LLM Reasoning Generalists with Preference Trees

作者: Lifan Yuan, Ganqu Cui, Hanbin Wang, Ning Ding, Xingyao Wang, Jia Deng, Boji Shan, Huimin Chen, Ruobing Xie, Yankai Lin, Zhenghao Liu, Bowen Zhou, Hao Peng, Zhiyuan Liu, Maosong Sun

分类: cs.AI, cs.CL, cs.LG

发布日期: 2024-04-02

备注: Models and data are available at https://github.com/OpenBMB/Eurus

💡 一句话要点

提出Eurus模型以提升复杂推理任务的性能

🎯 匹配领域: 支柱二：RL算法与架构 (RL & Architecture) 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 大型语言模型 推理能力 偏好学习 奖励建模 UltraInteract 复杂任务 开源模型

📋 核心要点

现有大型语言模型在复杂推理任务中的表现仍然有限，尤其是在数学和逻辑推理方面。
论文提出了Eurus模型，结合UltraInteract数据集，通过偏好学习和新型奖励建模目标，提升推理能力。
Eurus-70B在多个基准测试中表现优异，特别是在LeetCode和TheoremQA上显著超越了其他开源模型。

📝 摘要（中文）

我们介绍了Eurus，一个针对推理优化的大型语言模型（LLM）套件。Eurus模型基于Mistral-7B和CodeLlama-70B进行微调，在数学、代码生成和逻辑推理等多样化基准测试中，取得了开源模型中的领先成绩。Eurus-70B在12项测试中超越GPT-3.5 Turbo，LeetCode和TheoremQA两个具有挑战性的基准测试中，分别达到了33.3%和32.6%的通过率，显著优于现有开源模型。Eurus的强大性能主要归功于UltraInteract，这是一个专为复杂推理任务设计的大规模高质量对齐数据集。

🔬 方法详解

问题定义：本论文旨在解决现有大型语言模型在复杂推理任务中的性能不足，尤其是在数学和逻辑推理方面的挑战。现有方法在处理这些任务时常常表现不佳，缺乏有效的对齐和偏好学习机制。

核心思路：论文的核心解决思路是构建Eurus模型，并利用UltraInteract数据集进行微调。UltraInteract包含多样化的推理链和交互轨迹，旨在提升模型在复杂推理任务中的表现。

技术框架：Eurus模型的整体架构包括多个模块，首先是基于Mistral-7B和CodeLlama-70B的基础模型，然后通过UltraInteract进行监督微调和偏好学习。模型通过偏好树结构来组织推理链和交互数据。

关键创新：最重要的技术创新点在于UltraInteract数据集的构建和新型奖励建模目标的提出。与现有方法相比，这种设计更适合复杂推理任务，能够有效提升模型的推理能力。

关键设计：在模型训练中，采用了多轮交互轨迹和成对数据来促进偏好学习。损失函数和奖励模型的设计经过精心调整，以确保模型能够更好地理解和执行复杂推理任务。具体的参数设置和网络结构细节在论文中有详细描述。

📊 实验亮点

Eurus-70B在12项推理测试中超越GPT-3.5 Turbo，LeetCode和TheoremQA的通过率分别达到了33.3%和32.6%。这些结果显示Eurus模型在复杂推理任务中的显著优势，超出现有开源模型13.3%以上，展现了其强大的推理能力。

🎯 应用场景

Eurus模型的潜在应用场景包括教育、编程辅助和科学研究等领域。其在复杂推理任务中的优越性能能够为用户提供更准确的解答和建议，提升工作效率。未来，Eurus模型有望在更多实际应用中发挥重要作用，推动智能助手和自动化工具的发展。

📄 摘要（原文）

We introduce Eurus, a suite of large language models (LLMs) optimized for reasoning. Finetuned from Mistral-7B and CodeLlama-70B, Eurus models achieve state-of-the-art results among open-source models on a diverse set of benchmarks covering mathematics, code generation, and logical reasoning problems. Notably, Eurus-70B beats GPT-3.5 Turbo in reasoning through a comprehensive benchmarking across 12 tests covering five tasks, and achieves a 33.3% pass@1 accuracy on LeetCode and 32.6% on TheoremQA, two challenging benchmarks, substantially outperforming existing open-source models by margins more than 13.3%. The strong performance of Eurus can be primarily attributed to UltraInteract, our newly-curated large-scale, high-quality alignment dataset specifically designed for complex reasoning tasks. UltraInteract can be used in both supervised fine-tuning and preference learning. For each instruction, it includes a preference tree consisting of (1) reasoning chains with diverse planning strategies in a unified format, (2) multi-turn interaction trajectories with the environment and the critique, and (3) pairwise data to facilitate preference learning. UltraInteract allows us to conduct an in-depth exploration of preference learning for reasoning tasks. Our investigation reveals that some well-established preference learning algorithms may be less suitable for reasoning tasks compared to their effectiveness in general conversations. Inspired by this, we derive a novel reward modeling objective which, together with UltraInteract, leads to a strong reward model.

Advancing LLM Reasoning Generalists with Preference Trees

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理