LoomVideo: Unifying Multimodal Inputs into Video Generation and Editing

作者: Jianzong Wu, Hao Lian, Jiongfan Yang, Dachao Hao, Ye Tian, Yunhai Tong, Jingyuan Zhu, Biaolong Chen, Qiaosong Qi, Aixi Zhang, Wanggui He, Mushui Liu, Jinlong Liu, Pipei Huang, Hao Jiang

分类: cs.CV

发布日期: 2026-06-04 (更新: 2026-06-05)

💡 一句话要点

提出LoomVideo以解决多模态视频生成与编辑的计算复杂性问题

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 多模态视频生成 视频编辑 深层注入机制 零开销条件 扩散变换器 高效模型 电子商务 时尚生成

📋 核心要点

现有的多模态视频生成与编辑模型通常依赖于庞大的参数量，导致计算复杂性高，效率低下。
LoomVideo通过引入多模态大语言模型和零开销的缩放与添加条件方法，显著降低了计算成本。
实验结果显示，LoomVideo在多个基准测试中表现优异，推理速度比同类模型快至少5.41倍。

📝 摘要（中文）

开发统一的视频生成与编辑模型，能够解释交错的多模态输入，是一个充满潜力但具有挑战性的前沿领域。现有的统一框架主要依赖于庞大的模型（通常为130亿参数或更多），并通过连接序列标记来结合源视频条件进行编辑。这种连接不可避免地增加了序列长度，四倍增加了自注意力机制的计算复杂性，并引入了巨大的开销。为了解决这些瓶颈，我们提出了LoomVideo，一个高效的50亿参数统一架构，适用于视频生成和编辑。LoomVideo用多模态大语言模型（MLLM）替代了标准文本编码器，并采用深层注入机制将多层MLLM特征与扩散变换器（DiT）对齐。关键的是，我们引入了一种零开销的缩放与添加条件方法用于视频编辑。通过缩放并直接将干净的源视频潜在特征添加到噪声目标潜在特征中，这一优雅设计消除了标记连接的需要，显著降低了计算成本，同时保持了对复杂非刚性编辑的强大能力。大量实验表明，我们的紧凑型50亿模型在全面基准测试中实现了最先进或高度竞争的性能，在电子商务和时尚生成场景中表现出卓越的优势。

🔬 方法详解

问题定义：本论文旨在解决现有多模态视频生成与编辑模型在计算复杂性和效率上的不足，尤其是大规模模型带来的高开销问题。

核心思路：LoomVideo通过替换标准文本编码器为多模态大语言模型，并采用零开销的缩放与添加条件方法，避免了传统方法中序列标记连接带来的计算负担。

技术框架：LoomVideo的整体架构包括多模态大语言模型（MLLM）和扩散变换器（DiT），通过深层注入机制对齐多层特征，确保信息的有效传递与处理。

关键创新：论文的核心创新在于提出了零开销的缩放与添加条件方法，这一设计消除了对标记连接的需求，极大地降低了计算复杂性，与现有方法相比具有本质区别。

关键设计：在模型设计中，采用了50亿参数的紧凑型架构，结合深层注入机制和负时间RoPE策略，以处理多个参考图像，确保了模型在复杂编辑任务中的表现。

🖼️ 关键图片

📊 实验亮点

LoomVideo在多个基准测试中表现出色，尤其在电子商务和时尚生成场景中，取得了最先进的性能。与同类模型相比，其推理速度提升至少5.41倍，展现了显著的效率优势，具有重要的实际应用价值。

🎯 应用场景

LoomVideo的研究成果在电子商务、时尚生成等领域具有广泛的应用潜力。其高效的视频生成与编辑能力能够帮助商家快速制作高质量的产品视频，提升用户体验。此外，未来可扩展至更多创意产业，推动视频内容创作的自动化与智能化。

📄 摘要（原文）

Developing unified video generation and editing models capable of interpreting interleaved multimodal inputs is a promising yet challenging frontier field. Existing unified frameworks predominantly rely on massive models (typically 13B parameters or more) and incorporate source video conditions for editing by concatenating sequence tokens. This concatenation inevitably doubles the sequence length, quadrupling the computational complexity of the self-attention mechanism and introducing prohibitive overhead. To address these bottlenecks, we present LoomVideo, a highly efficient 5B-parameter unified architecture for both video generation and editing. LoomVideo replaces the standard text encoder with a Multimodal Large Language Model (MLLM) and employs Deepstack injection mechanism to align multi-layer MLLM features with the Diffusion Transformer (DiT). Crucially, we introduce a zero-overhead Scale-and-Add conditioning approach for video editing. By scaling and directly adding the clean source video latent to the noised target latent, this elegant design eliminates the need for token concatenation, drastically reducing computational cost while maintaining robust capabilities for complex, non-rigid edits. Furthermore, a Negative Temporal RoPE strategy is seamlessly integrated to handle multiple reference images. Extensive experiments demonstrate that our compact 5B model achieves state-of-the-art or highly competitive performance across comprehensive benchmarks, exhibiting exceptional superiority in e-commerce and fashion generation scenarios. Benefiting from the zero-overhead conditioning mechanism, LoomVideo achieves at least a 5.41x acceleration in inference speed compared to models of similar capabilities, paving the way for highly practical and efficient video foundation models.

LoomVideo: Unifying Multimodal Inputs into Video Generation and Editing

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理