Seed1.5-Thinking: Advancing Superb Reasoning Models with Reinforcement Learning

📄 arXiv: 2504.13914v3 📥 PDF

作者: ByteDance Seed, :, Jiaze Chen, Tiantian Fan, Xin Liu, Lingjun Liu, Zhiqi Lin, Mingxuan Wang, Chengyi Wang, Xiangpeng Wei, Wenyuan Xu, Yufeng Yuan, Yu Yue, Lin Yan, Qiying Yu, Xiaochen Zuo, Chi Zhang, Ruofei Zhu, Zhecheng An, Zhihao Bai, Yu Bao, Xingyan Bin, Jiangjie Chen, Feng Chen, Hongmin Chen, Riwei Chen, Liangqiang Chen, Zixin Chen, Jinsong Chen, Siyan Chen, Kaiyuan Chen, Zhi Chen, Jin Chen, Jiecao Chen, Jinxin Chi, Weinan Dai, Ning Dai, Jiahui Dai, Shihan Dou, Yantao Du, Zhengyin Du, Jianhui Duan, Chen Dun, Ting-Han Fan, Jiazhan Feng, Junda Feng, Ziyuan Feng, Yuwei Fu, Wenqi Fu, Hanjie Fu, Hao Ge, Hongyi Guo, Mingji Han, Li Han, Wenhao Hao, Xintong Hao, Qianyu He, Jerry He, Feng He, Wen Heng, Zehua Hong, Qi Hou, Liang Hu, Shengding Hu, Nan Hu, Kai Hua, Qi Huang, Ziyue Huang, Hongzhi Huang, Zihao Huang, Ting Huang, Wenhao Huang, Wei Jia, Bin Jia, Xiaoying Jia, Yuhua Jiang, Haobin Jiang, Ziheng Jiang, Kaihua Jiang, Chengquan Jiang, Jianpeng Jiao, Xiaoran Jin, Xing Jin, Xunhao Lai, Zheng Li, Xiang Li, Liyi Li, Hongkai Li, Zheng Li, Shengxian Wan, Ya Wang, Yunshui Li, Chenggang Li, Niuniu Li, Siyu Li, Xi Li, Xiao Li, Aoyan Li, Yuntao Li, Nianning Liang, Xinnian Liang, Haibin Lin, Weijian Lin, Ye Lin, Zhicheng Liu, Guanlin Liu, Guanlin Liu, Chenxiao Liu, Yan Liu, Gaohong Liu, Juncai Liu, Chundian Liu, Deyi Liu, Kaibo Liu, Siyao Liu, Qi Liu, Yongfei Liu, Kang Liu, Gan Liu, Boyi Liu, Rui Long, Weiqiang Lou, Chenwei Lou, Xiang Luo, Yao Luo, Caiping Lv, Heyang Lv, Bole Ma, Qianli Ma, Hongzhi Ma, Yiyuan Ma, Jin Ma, Wenchang Ma, Tingting Ma, Chen Mao, Qiyang Min, Zhe Nan, Guanghan Ning, Jinxiang Ou, Haojie Pan, Renming Pang, Yanghua Peng, Tao Peng, Lihua Qian, Lihua Qian, Mu Qiao, Meng Qu, Cheng Ren, Hongbin Ren, Yong Shan, Wei Shen, Ke Shen, Kai Shen, Guangming Sheng, Jinlong Shi, Wenlei Shi, Guang Shi, Shuai Shuai Cao, Yuxin Song, Zuquan Song, Jing Su, Yifan Sun, Tao Sun, Zewei Sun, Borui Wan, Zihan Wang, Xiaohui Wang, Xi Wang, Shuguang Wang, Jun Wang, Qinlong Wang, Chenyuan Wang, Shuai Wang, Zihan Wang, Changbao Wang, Jiaqiang Wang, Shihang Wang, Xuwu Wang, Zaiyuan Wang, Yuxuan Wang, Wenqi Wang, Taiqing Wang, Chengzhi Wei, Houmin Wei, Ziyun Wei, Shufa Wei, Zheng Wu, Yonghui Wu, Yangjun Wu, Bohong Wu, Shuang Wu, Jingqiao Wu, Ning Wu, Shuangzhi Wu, Jianmin Wu, Chenguang Xi, Fan Xia, Yuqiao Xian, Liang Xiang, Boren Xiang, Bowen Xiao, Zhen Xiao, Xia Xiao, Yongsheng Xiao, Chao Xin, Shulin Xin, Yuwen Xiong, Jingjing Xu, Ziwen Xu, Chenyin Xu, Jiayi Xu, Yifan Xu, Wei Xu, Yufei Xu, Shikun Xu, Shipeng Yan, Shen Yan, Qingping Yang, Xi Yang, Tianhao Yang, Yuehang Yang, Yuan Yang, Ximing Yang, Zeyu Yang, Guang Yang, Yifan Yang, Xuesong Yao, Bairen Yi, Fan Yin, Jianian Yin, Ziqiang Ying, Xiangyu Yu, Hongli Yu, Song Yu, Menghan Yu, Huan Yu, Siyu Yuan, Jun Yuan, Yutao Zeng, Tianyang Zhan, Zheng Zhang, Yun Zhang, Mofan Zhang, Wang Zhang, Ru Zhang, Zhi Zhang, Tianqi Zhang, Xinyi Zhang, Zhexi Zhang, Sijun Zhang, Wenqiang Zhang, Xiangxiang Zhang, Yongtao Zhang, Yuyu Zhang, Ge Zhang, He Zhang, Yue Zhang, Renjie Zheng, Ningxin Zheng, Zhuolin Zheng, Yaowei Zheng, Chen Zheng, Xiaoyun Zhi, Wanjun Zhong, Cheng Zhong, Zheng Zhong, Baoquan Zhong, Xun Zhou, Na Zhou, Huan Zhou, Hang Zhu, Defa Zhu, Wenjia Zhu, Lei Zuo

分类: cs.CL

发布日期: 2025-04-10 (更新: 2025-04-29)


💡 一句话要点

Seed1.5-Thinking:通过强化学习提升卓越推理模型,实现更广泛应用

🎯 匹配领域: 支柱二:RL算法与架构 (RL & Architecture)

关键词: 强化学习 推理模型 混合专家模型 MoE 大语言模型 STEM 代码生成

📋 核心要点

  1. 现有推理模型在复杂任务中表现不足,缺乏有效的思考过程来提升性能。
  2. Seed1.5-Thinking通过强化学习,使模型在响应前进行推理思考,从而提高性能。
  3. 实验表明,Seed1.5-Thinking在STEM、编码和非推理任务中均表现出色,胜率显著提升。

📝 摘要(中文)

本文介绍了Seed1.5-Thinking,该模型能够在响应前进行思考推理,从而显著提升在各种基准测试中的性能。Seed1.5-Thinking在AIME 2024上达到86.7分,在Codeforces上达到55.0分,在GPQA上达到77.3分,展示了其在STEM和编码方面的卓越推理能力。除了推理任务外,该方法还在不同领域表现出显著的泛化能力。例如,在非推理任务中,其胜率超过DeepSeek R1 8%,表明其具有更广泛的适用性。与其他最先进的推理模型相比,Seed1.5-Thinking是一个相对较小的混合专家(MoE)模型,具有20B激活参数和200B总参数。作为评估通用推理能力的一部分,我们开发了两个内部基准测试BeyondAIME和Codeforces,这两个基准测试都将公开发布,以支持未来的研究。

🔬 方法详解

问题定义:现有的大语言模型在复杂推理任务中,常常因为缺乏有效的思考过程而表现不佳,尤其是在需要多步骤推理和领域知识的任务中。现有的模型通常直接从输入到输出,缺乏中间的思考和探索过程,导致性能受限。

核心思路:Seed1.5-Thinking的核心思路是让模型在给出最终答案之前,先进行“思考”。通过引入一个显式的思考过程,模型可以逐步分解问题、探索不同的解决方案,并最终选择最优的答案。这种“思考先行”的策略借鉴了人类解决问题的模式,旨在提高模型的推理能力和泛化能力。

技术框架:Seed1.5-Thinking采用混合专家(MoE)架构,包含200B总参数,但仅激活20B参数,以实现高效的推理。整体流程包括:1) 输入问题;2) 模型进行多步思考,生成中间推理步骤;3) 模型基于思考结果生成最终答案。强化学习被用于训练模型进行有效的思考,奖励信号基于答案的正确性。

关键创新:Seed1.5-Thinking的关键创新在于将强化学习与MoE架构相结合,训练模型进行“思考”。与传统的端到端训练方法不同,该方法鼓励模型探索不同的推理路径,并学习选择最优的思考策略。此外,相对较小的模型规模(20B激活参数)在性能上超越了更大的模型,体现了高效的参数利用率。

关键设计:在强化学习训练中,奖励函数的设计至关重要,论文可能采用了基于正确答案的稀疏奖励,或者更复杂的奖励塑造策略,以引导模型进行有效的思考。MoE架构的具体实现细节,例如专家数量、路由机制等,也会影响模型的性能。具体的网络结构和参数设置细节未知,需要进一步查阅论文。

🖼️ 关键图片

fig_0
fig_1

📊 实验亮点

Seed1.5-Thinking在AIME 2024上达到86.7分,在Codeforces上达到55.0分,在GPQA上达到77.3分,展示了其在STEM和编码方面的卓越推理能力。在非推理任务中,其胜率超过DeepSeek R1 8%,表明其具有更广泛的适用性。该模型仅激活20B参数,远小于其他大型模型,体现了高效的参数利用率。

🎯 应用场景

Seed1.5-Thinking具有广泛的应用前景,包括智能客服、代码生成、科学研究、教育辅导等领域。通过提升模型的推理能力,可以使其更好地理解用户意图、生成高质量的代码、解决复杂的科学问题,并提供个性化的学习辅导。该研究有望推动人工智能在各个领域的应用,并为人类提供更智能、更高效的工具。

📄 摘要(原文)

We introduce Seed1.5-Thinking, capable of reasoning through thinking before responding, resulting in improved performance on a wide range of benchmarks. Seed1.5-Thinking achieves 86.7 on AIME 2024, 55.0 on Codeforces and 77.3 on GPQA, demonstrating excellent reasoning abilities in STEM and coding. Beyond reasoning tasks, the method demonstrates notable generalization across diverse domains. For instance, it surpasses DeepSeek R1 by 8% in win rate on non-reasoning tasks, indicating its broader applicability. Compared to other state-of-the-art reasoning models, Seed1.5-Thinking is a Mixture-of-Experts (MoE) model with a relatively small size, featuring 20B activated and 200B total parameters. As part of our effort to assess generalized reasoning, we develop two internal benchmarks, BeyondAIME and Codeforces, both of which will be publicly released to support future research. Model trial link: https://www.volcengine.com/experience/ark.