DisenStudio: Customized Multi-subject Text-to-Video Generation with Disentangled Spatial Control

作者: Hong Chen, Xin Wang, Yipeng Zhang, Yuwei Zhou, Zeyang Zhang, Siao Tang, Wenwu Zhu

分类: cs.CV

发布日期: 2024-05-21

💡 一句话要点

DisenStudio：提出解耦空间控制的多主体文本到视频生成框架

🎯 匹配领域: 支柱四：生成式动作 (Generative Motion) 支柱七：动作重定向 (Motion Retargeting) 支柱八：物理动画 (Physics-based Animation)

关键词: 文本到视频生成 多主体生成 空间解耦 交叉注意力 运动保持 扩散模型 定制化生成

📋 核心要点

现有文本到视频生成方法主要集中于单主体定制，在多主体场景下存在主体缺失和属性绑定问题。
DisenStudio通过空间解耦交叉注意力机制和运动保持解耦微调，实现多主体视频生成，并控制主体行为。
实验结果表明，DisenStudio在多主体视频生成任务上显著优于现有方法，并可应用于多种可控生成场景。

📝 摘要（中文）

本文提出DisenStudio，一个新颖的框架，能够根据文本指导，为定制的多个主体生成视频，仅需每个主体的少量图像。DisenStudio通过提出的空间解耦交叉注意力机制增强了预训练的基于扩散的文本到视频模型，将每个主体与期望的动作相关联。然后，通过提出的运动保持解耦微调，为多个主体定制模型，该微调包括三个策略：多主体共现微调、掩码单主体微调和多主体运动保持微调。前两个策略保证了主体的出现并保留其视觉属性，第三个策略帮助模型在静态图像上微调时保持时间运动生成能力。大量实验表明，DisenStudio在各种指标上显著优于现有方法。此外，DisenStudio可以作为各种可控生成应用的强大工具。

🔬 方法详解

问题定义：现有文本到视频生成方法在处理多主体定制化生成任务时，面临主体缺失、属性绑定以及动作绑定等问题。具体来说，模型难以保证视频中出现所有指定主体，无法维持各自主体的独特视觉特征，并且难以将特定动作分配给指定的主体。

核心思路：DisenStudio的核心思路是利用解耦的空间控制来实现多主体视频生成。通过空间解耦交叉注意力机制，将每个主体与期望的动作关联起来，从而解决动作绑定问题。同时，采用运动保持解耦微调策略，在保证主体视觉属性的同时，维持模型的时间运动生成能力。

技术框架：DisenStudio框架基于预训练的扩散模型，主要包含两个关键模块：空间解耦交叉注意力机制和运动保持解耦微调。空间解耦交叉注意力机制用于将每个主体与期望的动作相关联。运动保持解耦微调包含三个阶段：多主体共现微调、掩码单主体微调和多主体运动保持微调。

关键创新：DisenStudio的关键创新在于提出了空间解耦交叉注意力机制和运动保持解耦微调策略。空间解耦交叉注意力机制允许模型独立地控制每个主体的动作，而运动保持解耦微调则保证了在静态图像上微调时，模型能够维持其原有的时间运动生成能力。这与现有方法直接在多主体数据上进行微调的方式不同，避免了灾难性遗忘问题。

关键设计：运动保持解耦微调包含三个阶段。多主体共现微调旨在使模型能够生成包含所有指定主体的视频。掩码单主体微调通过掩盖其他主体，使模型专注于学习每个主体的独特视觉特征。多主体运动保持微调则通过引入运动损失，保证模型在静态图像上微调时，能够维持其原有的时间运动生成能力。具体的损失函数设计和参数设置在论文中有详细描述。

📊 实验亮点

实验结果表明，DisenStudio在多主体视频生成任务上显著优于现有方法。具体来说，DisenStudio在主体一致性、动作准确性和视频质量等指标上均取得了显著提升。相较于基线方法，DisenStudio能够生成更逼真、更符合用户期望的多主体视频内容。具体的性能数据和对比结果在论文中有详细呈现。

🎯 应用场景

DisenStudio具有广泛的应用前景，例如：个性化视频内容创作，用户可以根据自己的需求定制包含特定人物和动作的视频；虚拟角色生成，可以为游戏或动画创建具有特定外观和行为的角色；教育领域，可以生成包含多个角色互动教学视频。该研究有助于推动视频生成技术的发展，并为用户提供更灵活、更可控的视频创作工具。

📄 摘要（原文）

Generating customized content in videos has received increasing attention recently. However, existing works primarily focus on customized text-to-video generation for single subject, suffering from subject-missing and attribute-binding problems when the video is expected to contain multiple subjects. Furthermore, existing models struggle to assign the desired actions to the corresponding subjects (action-binding problem), failing to achieve satisfactory multi-subject generation performance. To tackle the problems, in this paper, we propose DisenStudio, a novel framework that can generate text-guided videos for customized multiple subjects, given few images for each subject. Specifically, DisenStudio enhances a pretrained diffusion-based text-to-video model with our proposed spatial-disentangled cross-attention mechanism to associate each subject with the desired action. Then the model is customized for the multiple subjects with the proposed motion-preserved disentangled finetuning, which involves three tuning strategies: multi-subject co-occurrence tuning, masked single-subject tuning, and multi-subject motion-preserved tuning. The first two strategies guarantee the subject occurrence and preserve their visual attributes, and the third strategy helps the model maintain the temporal motion-generation ability when finetuning on static images. We conduct extensive experiments to demonstrate our proposed DisenStudio significantly outperforms existing methods in various metrics. Additionally, we show that DisenStudio can be used as a powerful tool for various controllable generation applications.

DisenStudio: Customized Multi-subject Text-to-Video Generation with Disentangled Spatial Control

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理