InternBootcamp Technical Report: Boosting LLM Reasoning with Verifiable Task Scaling

作者: Peiji Li, Jiasheng Ye, Yongkang Chen, Yichuan Ma, Zijie Yu, Kedi Chen, Ganqu Cui, Haozhan Li, Jiacheng Chen, Chengqi Lyu, Wenwei Zhang, Linyang Li, Qipeng Guo, Dahua Lin, Bowen Zhou, Kai Chen

分类: cs.CL

发布日期: 2025-08-12

备注: InternBootcamp Tech Report

💡 一句话要点

提出InternBootcamp以解决LLM推理能力不足的问题

🎯 匹配领域: 支柱二：RL算法与架构 (RL & Architecture) 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 大型语言模型 推理能力 强化学习 任务生成 模型评估 开源框架 自动化验证

📋 核心要点

现有方法主要集中在特定领域推理任务，无法应对复杂多样的现实推理场景。
论文提出InternBootcamp框架，支持自动生成多样化任务和集成验证模块，提升LLM推理能力。
实验表明，使用InternBootcamp训练的32B模型在Bootcamp-EVAL基准上表现优异，显著提升推理性能。

📝 摘要（中文）

大型语言模型（LLMs）通过增强复杂推理能力，彻底改变了人工智能领域。尽管近期强化学习（RL）在特定领域推理任务上取得了进展，但现实世界的推理场景往往需要模型处理多样且复杂的环境，这些环境无法通过狭域基准充分捕捉。为了解决这一问题，我们提出了InternBootcamp，这是一个开源框架，包含1000多个领域多样的任务环境，专为LLM推理研究设计。我们的代码库提供两项关键功能：自动生成可配置难度级别的无限训练/测试案例，以及集成的验证模块以进行客观响应评估。这些特性使InternBootcamp成为基于RL的模型优化、合成数据生成和模型评估的基础设施。

🔬 方法详解

问题定义：本论文旨在解决大型语言模型在复杂推理任务中的表现不足，现有方法多集中于狭域任务，无法适应多样化的现实场景。

核心思路：提出InternBootcamp框架，通过自动化生成多样化的任务环境，结合验证模块，提升LLM的推理能力与适应性。

技术框架：InternBootcamp框架包括任务生成模块、验证模块和评估模块。任务生成模块负责创建多样化的训练和测试案例，验证模块用于评估模型响应的准确性，评估模块则提供综合性能评估。

关键创新：最重要的创新在于通过自动化流程快速扩展任务范围，解决了手动开发任务覆盖面广泛的难题，显著提高了模型的训练效果。

关键设计：框架支持可配置的难度级别，允许用户根据需求生成不同复杂度的任务，验证模块采用了客观评估标准，确保模型性能评估的准确性。

📊 实验亮点

实验结果显示，使用InternBootcamp训练的32B模型在Bootcamp-EVAL基准上达到了最先进的性能，显著优于其他基线模型，尤其是在推理任务的表现上提升了两个数量级，验证了任务扩展对模型性能的积极影响。

🎯 应用场景

该研究的潜在应用领域包括教育、自动化推理系统和智能助手等。通过提供多样化的推理任务，InternBootcamp能够帮助研究人员和开发者更好地训练和评估大型语言模型，推动AI在复杂推理场景中的应用。未来，该框架可能成为LLM研究的标准工具，促进更广泛的应用和技术进步。

📄 摘要（原文）

Large language models (LLMs) have revolutionized artificial intelligence by enabling complex reasoning capabilities. While recent advancements in reinforcement learning (RL) have primarily focused on domain-specific reasoning tasks (e.g., mathematics or code generation), real-world reasoning scenarios often require models to handle diverse and complex environments that narrow-domain benchmarks cannot fully capture. To address this gap, we present InternBootcamp, an open-source framework comprising 1000+ domain-diverse task environments specifically designed for LLM reasoning research. Our codebase offers two key functionalities: (1) automated generation of unlimited training/testing cases with configurable difficulty levels, and (2) integrated verification modules for objective response evaluation. These features make InternBootcamp fundamental infrastructure for RL-based model optimization, synthetic data generation, and model evaluation. Although manually developing such a framework with enormous task coverage is extremely cumbersome, we accelerate the development procedure through an automated agent workflow supplemented by manual validation protocols, which enables the task scope to expand rapidly. % With these bootcamps, we further establish Bootcamp-EVAL, an automatically generated benchmark for comprehensive performance assessment. Evaluation reveals that frontier models still underperform in many reasoning tasks, while training with InternBootcamp provides an effective way to significantly improve performance, leading to our 32B model that achieves state-of-the-art results on Bootcamp-EVAL and excels on other established benchmarks. In particular, we validate that consistent performance gains come from including more training tasks, namely \textbf{task scaling}, over two orders of magnitude, offering a promising route towards capable reasoning generalist.

InternBootcamp Technical Report: Boosting LLM Reasoning with Verifiable Task Scaling

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册