Next Forcing: Causal World Modeling with Multi-Chunk Prediction
作者: Gangwei Xu, Qihang Zhang, Jiaming Zhou, Xing Zhu, Yujun Shen, Xin Yang, Yinghao Xu
分类: cs.CV
发布日期: 2026-06-09
备注: Project page: https://gangweix.github.io/next-forcing/
💡 一句话要点
提出Next Forcing以解决视频生成训练慢和推理效率低的问题
🎯 匹配领域: 支柱二:RL算法与架构 (RL & Architecture) 支柱九:具身大模型 (Embodied Foundation Models)
关键词: 自回归视频生成 多块预测 因果建模 训练加速 推理优化 视频去噪 动态预测
📋 核心要点
- 现有自回归视频生成方法在高帧率下训练收敛慢且准确性有限,推理效率低下。
- 提出Next Forcing框架,通过多块预测(MCP)模块同时去噪多个未来时间段的视频块,形成因果链。
- 在50fps下,Next Forcing在5k训练步长时相较于LingBot-VA实现93.1%的相对提升,并在RoboTwin基准上取得新记录。
📝 摘要(中文)
自回归视频生成已成为世界动作模型(WAMs)的强大范式。然而,现有方法在高帧率下训练收敛慢且准确性有限,主要由于训练监督仅限于当前块,缺乏对未来动态的显式信号;同时,推理过程也因迭代视频去噪而变得缓慢。本文提出了Next Forcing,一个多块预测(MCP)框架,旨在加速训练、提高准确性并加快推理。Next Forcing通过引入轻量级辅助MCP模块,在多个未来时间范围内同时去噪视频块,形成因果链,利用主模型的中间特征预测未来动态。实验结果表明,在50fps下,Next Forcing在5k训练步长时相较于LingBot-VA实现了93.1%的相对提升,并在RoboTwin基准上建立了新的最先进结果。
🔬 方法详解
问题定义:本文旨在解决现有自回归视频生成方法在高帧率下训练收敛慢和推理效率低的问题。现有方法的训练监督仅限于当前视频块,缺乏对未来动态的有效信号,导致准确性不足。
核心思路:Next Forcing框架的核心思路是引入多块预测(MCP)模块,通过同时去噪多个未来时间段的视频块,形成因果链,使得近未来的预测能够影响远未来的预测,从而提供更丰富的时间监督。
技术框架:该框架包含主模型和多个轻量级的MCP模块,MCP模块在训练过程中与主模型并行工作,利用主模型的中间特征进行未来动态的预测。训练时,MCP模块通过多层特征融合,提升了模型的收敛速度和准确性。
关键创新:Next Forcing的主要创新在于引入了MCP训练目标,使得模型能够在多个时间范围内进行预测,形成因果链。这一设计与传统方法的单一时间步预测形成了本质区别。
关键设计:在技术细节上,MCP模块的设计注重轻量化,确保在不显著增加计算负担的情况下,提升模型的性能。损失函数的设计也考虑了多尺度时间监督,以增强模型对未来动态的理解。
🖼️ 关键图片
📊 实验亮点
在实验中,Next Forcing在50fps下相较于LingBot-VA实现了93.1%的相对提升,并在5k训练步长时收敛速度提高了2.3倍。此外,该方法在RoboTwin基准上达到了94.1%和93.5%的新状态,分别对应于Clean和Random场景,显示出显著的性能优势。
🎯 应用场景
Next Forcing框架在视频生成领域具有广泛的应用潜力,尤其是在需要高帧率和高准确性的场景,如自动驾驶、虚拟现实和游戏开发等。通过加速训练和推理,该方法能够为实时视频生成提供更高效的解决方案,推动相关技术的发展和应用。
📄 摘要(原文)
Autoregressive video generation has emerged as a powerful paradigm for World Action Models (WAMs). However, existing approaches suffer from slow training convergence and limited converged accuracy, particularly at high frame rates, as the training supervision is confined to the current chunk without explicit signals about future dynamics; they also suffer from slow inference due to iterative video denoising. In this paper, we present Next Forcing, a multi-chunk prediction (MCP) framework for causal world modeling that enables faster training, higher accuracy, and accelerated inference. Inspired by multi-token prediction in large language models, Next Forcing introduces an MCP training objective that augments the main model with lightweight auxiliary MCP modules to simultaneously denoise video chunks at multiple future temporal horizons (next$^1$, next$^2$, next$^3$ chunks). These MCP modules form a causal chain across prediction depths, where intermediate features fused from multiple layers of the main model are leveraged to predict future dynamics, allowing near-future predictions to inform farther-future ones and providing dense multi-scale temporal supervision back to the main model. During training, the MCP modules significantly accelerate convergence and improve converged accuracy, especially at high frame rates: at 50 fps, Next Forcing achieves a 93.1% relative improvement over LingBot-VA at 5k training steps and 2.3x faster convergence, and establishes new state-of-the-art results on the RoboTwin benchmark (94.1/93.5% on Clean/Random). At inference, the MCP modules can be retained to predict the next video chunk in parallel with the current one, achieving 2x inference acceleration. Next Forcing also demonstrates significant improvements on PhyWorld, a benchmark evaluating adherence to physical laws in video generation, and over 50% FVD reduction on general video pretraining.