ChatQA 2: Bridging the Gap to Proprietary LLMs in Long Context and RAG Capabilities

作者: Peng Xu, Wei Ping, Xianchao Wu, Chejian Xu, Zihan Liu, Mohammad Shoeybi, Bryan Catanzaro

分类: cs.CL, cs.AI, cs.IR, cs.LG

发布日期: 2024-07-19 (更新: 2025-02-14)

备注: Accepted at ICLR 2025

💡 一句话要点

ChatQA 2：基于Llama 3.0，提升长文本理解和RAG能力，媲美GPT-4-Turbo

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 长文本理解 检索增强生成 指令调优 上下文窗口扩展 开源大模型

📋 核心要点

现有开源LLM在长文本理解和RAG能力上与专有模型存在差距，限制了其在处理大规模信息任务中的应用。
ChatQA 2通过持续训练和三阶段指令调优，扩展了Llama 3.0的上下文窗口至128K，并提升了其指令遵循、RAG性能和长文本理解能力。
实验结果表明，ChatQA 2在超长文本任务和RAG基准测试中，性能超越了GPT-4-Turbo等领先模型，证明了其强大的长文本处理能力。

📝 摘要（中文）

本文介绍了ChatQA 2，一个基于Llama 3.0的模型，拥有128K的上下文窗口，旨在弥合开源LLM与领先的专有模型（如GPT-4-Turbo-2024-04-09）在长文本理解和检索增强生成（RAG）能力方面的差距。这两种能力相互补充，对于LLM处理无法放入单个提示中的大量信息至关重要。我们详细介绍了持续训练方案，将Llama3-70B-base的上下文窗口从8K扩展到128K tokens，以及一个三阶段的指令调优过程，以增强模型的指令遵循、RAG性能和长文本理解能力。结果表明，Llama3-ChatQA-2-70B模型在超过100K tokens的超长任务以及仅使用4K上下文窗口的RAG基准测试中，优于大多数现有的最先进模型，包括GPT-4-Turbo-2024-04-09、Qwen2-72B-Instruct和Llama3.1-70B-Instruct，展示了强大的跨不同序列长度的长文本能力。我们进一步提供了使用相同最先进的长文本LLM的直接长文本和RAG解决方案之间的广泛比较。有趣的是，我们发现使用RAG的强大长文本LLM在检索更多chunks时性能有所提高。通过大量的top-k chunks，RAG始终优于使用相同最先进长文本模型（例如Llama3-ChatQA-2-70B和Qwen2-72B-Instruct）的直接长文本解决方案，无论是在32K还是128K基准测试中。我们向社区开源了模型权重、训练数据和评估设置。

🔬 方法详解

问题定义：现有开源LLM在处理长文本时，存在上下文窗口不足、长距离依赖建模能力弱等问题，导致在长文本理解和RAG任务中性能不佳。专有模型如GPT-4-Turbo虽然性能强大，但闭源限制了研究和应用。

核心思路：通过扩展Llama 3.0的上下文窗口，并结合指令调优，提升模型在长文本上的理解和生成能力。同时，探索直接长文本处理和RAG两种方案，并分析其优劣，旨在找到更有效的长文本处理方法。

技术框架：ChatQA 2的训练分为三个阶段：1) 上下文窗口扩展：使用持续训练方法，将Llama3-70B-base的上下文窗口从8K扩展到128K。2) 指令调优：采用三阶段指令调优策略，分别提升模型的指令遵循能力、RAG性能和长文本理解能力。3) RAG集成：研究RAG方法，并与直接长文本处理进行对比，分析不同top-k chunks数量对RAG性能的影响。

关键创新：1) 上下文窗口扩展方法：采用高效的持续训练方法，成功将Llama 3.0的上下文窗口扩展到128K。2) 三阶段指令调优策略：针对性地提升模型在不同方面的能力，使模型在长文本理解和RAG任务中表现更佳。3) RAG与直接长文本处理的对比分析：深入研究了两种方法的优劣，为长文本处理提供了更全面的视角。

关键设计：1) 上下文窗口扩展：具体训练数据和训练策略未知。2) 指令调优：三阶段指令调优的具体指令集和损失函数未知。3) RAG集成：使用的检索方法和索引构建方式未知。4) 评估指标：采用了哪些评估指标来衡量长文本理解和RAG性能未知。

🖼️ 关键图片

📊 实验亮点

ChatQA 2在超长文本任务（超过100K tokens）和RAG基准测试中，性能超越了GPT-4-Turbo-2024-04-09、Qwen2-72B-Instruct和Llama3.1-70B-Instruct等模型。在RAG任务中，增加检索的chunks数量可以显著提升性能，甚至超过直接使用长文本模型。

🎯 应用场景

ChatQA 2可应用于需要处理大量文本信息的场景，如长文档问答、法律文本分析、金融报告解读、科研论文总结等。其开源特性有助于推动长文本处理技术的发展，并为相关领域的应用提供更强大的基础模型。

📄 摘要（原文）

In this work, we introduce ChatQA 2, an Llama 3.0-based model with a 128K context window, designed to bridge the gap between open-source LLMs and leading proprietary models (e.g., GPT-4-Turbo-2024-04-09) in long context understanding and retrieval-augmented generation (RAG) capabilities. These two capabilities are complementary to each other and essential for LLMs to process large volumes of information that cannot fit into a single prompt. We present a detailed continued training recipe to extend the context window of Llama3-70B-base from 8K to 128K tokens, along with a three-stage instruction tuning process to enhance the model's instruction-following, RAG performance, and long-context understanding capabilities. Our results demonstrate that the Llama3-ChatQA-2-70B model outperforms most existing state-of-the-art models, including GPT-4-Turbo-2024-04-09, Qwen2-72B-Instruct, and Llama3.1-70B-Instruct, on ultra-long tasks beyond 100K tokens, as well as on the RAG benchmark using only a 4K context window, showing the strong long context capability across varying sequence lengths. We further provide extensive comparisons between direct long-context and RAG solutions using the same state-of-the-art long-context LLMs. Interestingly, we find that the performance of strong long-context LLMs using RAG improves when retrieving a larger number of chunks. With a large set of top-k chunks, RAG consistently outperforms direct long-context solution using the same state-of-the-art long-context models (e.g., Llama3-ChatQA-2-70B and Qwen2-72B-Instruct) on both 32K and 128K benchmarks. We open-source the model weights, training data, and the evaluation setup for the for the community: https://chatqa2-project.github.io/

ChatQA 2: Bridging the Gap to Proprietary LLMs in Long Context and RAG Capabilities

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理