Step-3 is Large yet Affordable: Model-system Co-design for Cost-effective Decoding

📄 arXiv: 2507.19427v1 📥 PDF

作者: StepFun, :, Bin Wang, Bojun Wang, Changyi Wan, Guanzhe Huang, Hanpeng Hu, Haonan Jia, Hao Nie, Mingliang Li, Nuo Chen, Siyu Chen, Song Yuan, Wuxun Xie, Xiaoniu Song, Xing Chen, Xingping Yang, Xuelin Zhang, Yanbo Yu, Yaoyu Wang, Yibo Zhu, Yimin Jiang, Yu Zhou, Yuanwei Lu, Houyi Li, Jingcheng Hu, Ka Man Lo, Ailin Huang, Binxing Jiao, Bo Li, Boyu Chen, Changxin Miao, Chang Lou, Chen Hu, Chen Xu, Chenfeng Yu, Chengyuan Yao, Daokuan Lv, Dapeng Shi, Deshan Sun, Ding Huang, Dingyuan Hu, Dongqing Pang, Enle Liu, Fajie Zhang, Fanqi Wan, Gulin Yan, Han Zhang, Han Zhou, Hanghao Wu, Hangyu Guo, Hanqi Chen, Hanshan Zhang, Hao Wu, Haocheng Zhang, Haolong Yan, Haoran Lv, Haoran Wei, Hebin Zhou, Heng Wang, Heng Wang, Hongxin Li, Hongyu Zhou, Hongyuan Wang, Huiyong Guo, Jia Wang, Jiahao Gong, Jialing Xie, Jian Zhou, Jianjian Sun, Jiaoren Wu, Jiaran Zhang, Jiayu Liu, Jie Cheng, Jie Luo, Jie Yan, Jie Yang, Jieyi Hou, Jinguang Zhang, Jinlan Cao, Jisheng Yin, Junfeng Liu, Junhao Huang, Junzhe Lin, Kaijun Tan, Kaixiang Li, Kang An, Kangheng Lin, Kenkun Liu, Lei Yang, Liang Zhao, Liangyu Chen, Lieyu Shi, Liguo Tan, Lin Lin, Lin Zhang, Lina Chen, Liwen Huang, Liying Shi, Longlong Gu, Mei Chen, Mengqiang Ren, Ming Li, Mingzhe Chen, Na Wang, Nan Wu, Qi Han, Qian Zhao, Qiang Zhang, Qianni Liu, Qiaohui Chen, Qiling Wu, Qinglin He, Qinyuan Tan, Qiufeng Wang, Qiuping Wu, Qiuyan Liang, Quan Sun, Rui Li, Ruihang Miao, Ruosi Wan, Ruyan Guo, Shangwu Zhong, Shaoliang Pang, Shengjie Fan, Shijie Shang, Shilei Jiang, Shiliang Yang, Shiming Hao, Shuli Gao, Siming Huang, Siqi Liu, Tiancheng Cao, Tianhao Cheng, Tianhao Peng, Wang You, Wei Ji, Wen Sun, Wenjin Deng, Wenqing He, Wenzhen Zheng, Xi Chen, Xiangwen Kong, Xianzhen Luo, Xiaobo Yang, Xiaojia Liu, Xiaoxiao Ren, Xin Han, Xin Li, Xin Wu, Xu Zhao, Yanan Wei, Yang Li, Yangguang Li, Yangshijie Xu, Yanming Xu, Yaqiang Shi, Yeqing Shen, Yi Yang, Yifei Yang, Yifeng Gong, Yihan Chen, Yijing Yang, Yinmin Zhang, Yizhuang Zhou, Yuanhao Ding, Yuantao Fan, Yuanzhen Yang, Yuchu Luo, Yue Peng, Yufan Lu, Yuhang Deng, Yuhe Yin, Yujie Liu, Yukun Chen, Yuling Zhao, Yun Mou, Yunlong Li, Yunzhou Ju, Yusheng Li, Yuxiang Yang, Yuxiang Zhang, Yuyang Chen, Zejia Weng, Zhe Xie, Zheng Ge, Zheng Gong, Zhenyi Lu, Zhewei Huang, Zhichao Chang, Zhiguo Huang, Zhirui Wang, Zidong Yang, Zili Wang, Ziqi Wang, Zixin Zhang, Binxing Jiao, Daxin Jiang, Heung-Yeung Shum, Xiangyu Zhang

分类: cs.LG, cs.AI

发布日期: 2025-07-25


💡 一句话要点

Step-3:面向解码成本优化的模型-系统协同设计,实现高性价比的大语言模型

🎯 匹配领域: 支柱九:具身大模型 (Embodied Foundation Models)

关键词: 大语言模型 解码优化 模型系统协同设计 多矩阵分解注意力 分布式推理

📋 核心要点

  1. 大语言模型在解码过程中面临硬件效率低下的问题,尤其是在处理长上下文推理任务时。
  2. Step-3通过模型-系统协同设计,提出了多矩阵分解注意力(MFA)和注意力-FFN解耦(AFD)机制,优化解码成本。
  3. 实验表明,Step-3在解码吞吐量上优于DeepSeek-V3,并在成本效益方面设立了新的Pareto前沿。

📝 摘要(中文)

本文介绍了Step-3,一个拥有3210亿参数的VLM,它采用硬件感知的模型-系统协同设计,旨在最小化解码成本,尤其是在长上下文推理任务中。Step-3在两个关键维度上进行了创新:(1) 一种新颖的多矩阵分解注意力(MFA)机制,在保持高注意力表达能力的同时,显著降低了KV缓存大小和计算量;(2) 注意力-FFN解耦(AFD),一种将注意力和前馈网络(FFN)层解耦到专用子系统的分布式推理系统。这种协同设计实现了前所未有的成本效率:与DeepSeek-V3和Qwen3 MoE 235B等模型相比,Step-3显著降低了理论解码成本,并且随着上下文长度的增加,优势更加明显。Step-3在激活每个token 380亿参数(高于DeepSeek-V3和Qwen3 MoE 235B)的情况下实现了低成本,表明硬件对齐的注意力算术强度、MoE稀疏性和AFD对于成本效益至关重要。在Hopper GPU上的实现表明,在50ms TPOT SLA(4K上下文,FP8,无MTP)下,解码吞吐量高达每GPU每秒4,039个token,高于DeepSeek-V3的2,324,为LLM解码设定了新的Pareto前沿。

🔬 方法详解

问题定义:现有大语言模型在解码阶段,尤其是在长文本推理任务中,面临着硬件效率低下的问题。具体表现为KV缓存占用空间大、计算复杂度高,导致推理速度慢、成本高昂。现有方法难以在保证模型性能的同时,有效降低解码成本。

核心思路:Step-3的核心思路是通过模型和系统的协同设计,从算法层面和系统架构层面同时优化解码过程。具体来说,通过创新的多矩阵分解注意力(MFA)机制降低KV缓存大小和计算量,并通过注意力-FFN解耦(AFD)的分布式推理系统,将计算任务分配到专用子系统,提高硬件利用率。

技术框架:Step-3的整体框架包含两个主要组成部分:一是基于多矩阵分解注意力(MFA)的模型架构,二是注意力-FFN解耦(AFD)的分布式推理系统。MFA负责降低计算复杂度和KV缓存大小,AFD负责将注意力层和FFN层分配到不同的硬件资源上进行并行计算。

关键创新:Step-3的关键创新在于多矩阵分解注意力(MFA)机制和注意力-FFN解耦(AFD)的分布式推理系统。MFA通过矩阵分解的方式,在不损失过多模型表达能力的前提下,显著降低了KV缓存的大小和计算量。AFD则打破了传统LLM推理的串行模式,通过将计算任务分配到专用子系统,提高了硬件利用率和推理速度。

关键设计:MFA的具体实现细节未知,但其核心思想是通过矩阵分解降低计算复杂度和存储需求。AFD的关键设计在于如何有效地将注意力层和FFN层解耦,并设计合适的通信机制,保证数据在不同子系统之间的传输效率。论文中提到,Step-3激活每个token 380亿参数,表明其模型规模较大,但通过MFA和AFD,实现了高效的解码。

🖼️ 关键图片

fig_0
fig_1
fig_2

📊 实验亮点

Step-3在Hopper GPU上实现了每GPU每秒4,039个token的解码吞吐量(4K上下文,FP8,无MTP),超过了DeepSeek-V3的2,324。这表明Step-3在解码效率方面取得了显著提升,并在LLM解码领域设立了新的性能标杆。此外,Step-3在激活更多参数的情况下实现了更低的解码成本,验证了硬件感知的模型-系统协同设计的有效性。

🎯 应用场景

Step-3的研究成果可应用于各种需要高效长文本处理的场景,例如智能客服、文档摘要、机器翻译、代码生成等。通过降低解码成本,可以使得更大规模的语言模型在资源受限的环境中部署和应用成为可能,从而推动人工智能技术的普及。

📄 摘要(原文)

Large language models (LLMs) face low hardware efficiency during decoding, especially for long-context reasoning tasks. This paper introduces Step-3, a 321B-parameter VLM with hardware-aware model-system co-design optimized for minimizing decoding costs. Step-3 innovates in two key dimensions: (1) A novel Multi-Matrix Factorization Attention (MFA) mechanism that significantly reduces both KV cache size and computation while maintaining high attention expressiveness, and (2) Attention-FFN Disaggregation (AFD), a distributed inference system that decouples attention and Feed-Forward Network (FFN) layers into specialized subsystems. This co-design achieves unprecedented cost efficiency: Step-3 significantly reduces theoretical decoding costs compared with models like DeepSeek-V3 and Qwen3 MoE 235B, with the gains widening at longer context. Step-3 achieves low cost while activating 38B parameters per token (more than DeepSeek-V3 and Qwen3 MoE 235B), demonstrating that hardware-aligned attention arithmetic intensity, MoE sparsity, and AFD are critical to cost-effectiveness. We perform a head-to-head comparison with DeepSeek-V3 in its favorable scenarios. Our implementation on Hopper GPUs achieves a decoding throughput of up to 4,039 tokens per second per GPU under 50ms TPOT SLA (4K context, FP8, no MTP). It is higher than DeepSeek-V3's 2,324 in the same setup and sets a new Pareto frontier for LLM decoding.