Controllable Longer Image Animation with Diffusion Models

作者: Qiang Wang, Minghua Liu, Junjun Hu, Fan Jiang, Mu Xu

分类: cs.CV

发布日期: 2024-05-27 (更新: 2024-05-28)

备注: https://wangqiang9.github.io/Controllable.github.io/

🔗 代码/项目: PROJECT_PAGE

💡 一句话要点

提出基于扩散模型的图像动画方法，实现可控的更长时程视频生成。

🎯 匹配领域: 支柱七：动作重定向 (Motion Retargeting) 支柱八：物理动画 (Physics-based Animation)

关键词: 图像动画 视频扩散模型 运动先验 长时程视频生成 噪声重调度 可控生成 视频生成

📋 核心要点

现有图像动画方法在复杂环境和物理动态方面存在局限，难以处理特定纹理和运动轨迹之外的情况。
该方法利用视频扩散模型和运动先验，实现对图像动画中运动方向和速度的精确控制。
通过噪声重调度策略，该方法能够生成超过100帧的长时程视频，并在内容和运动上保持一致性。

📝 摘要（中文）

本文提出了一种基于视频扩散模型的开放域可控图像动画方法，该方法利用运动先验实现对可移动区域运动方向和速度的精确控制，通过提取视频中的运动场信息并学习运动轨迹和强度。针对现有预训练视频生成模型通常只能生成短视频（小于30帧）的局限性，本文提出了一种高效的长时程视频生成方法，该方法基于噪声重调度，专门为图像动画任务定制，能够生成超过100帧的视频，同时保持内容场景和运动协调性的一致性。具体来说，我们将去噪过程分解为两个不同的阶段：场景轮廓的塑造和运动细节的细化。然后，我们重新安排噪声以控制生成的帧序列，从而保持长距离噪声相关性。大量的实验结果表明，该方法优于包括商业工具和学术方法在内的10个基线方法。

🔬 方法详解

问题定义：现有基于物理模拟和运动预测的图像动画方法，难以处理复杂环境和物理动态，且通常局限于特定对象纹理和运动轨迹。预训练视频生成模型生成的视频长度有限，难以满足长时程动画的需求。

核心思路：利用视频扩散模型学习运动先验，从而实现对图像动画的精确控制。通过噪声重调度策略，延长视频生成的时间跨度，同时保持内容和运动的一致性。将去噪过程分解为场景轮廓塑造和运动细节细化两个阶段，分别处理。

技术框架：该方法主要包含运动信息提取模块、视频扩散模型和噪声重调度模块。首先，从视频中提取运动场信息，学习运动轨迹和强度。然后，利用视频扩散模型生成动画视频。最后，通过噪声重调度策略，控制生成帧序列，维持长距离噪声相关性，从而生成长时程视频。

关键创新：该方法的核心创新在于：1) 利用运动先验实现对图像动画运动方向和速度的精确控制；2) 提出了一种基于噪声重调度的长时程视频生成方法，突破了现有视频生成模型在视频长度上的限制。3) 将去噪过程分解为两个阶段，分别处理场景轮廓和运动细节。

关键设计：噪声重调度策略是关键设计之一，通过调整噪声的添加和去除方式，控制生成帧序列，维持长距离噪声相关性。具体参数设置和网络结构细节在论文中未明确给出，属于未知信息。

🖼️ 关键图片

📊 实验亮点

实验结果表明，该方法在图像动画生成方面优于包括商业工具和学术方法在内的10个基线方法。该方法能够生成超过100帧的长时程视频，并在内容场景和运动协调性上保持一致性，显著提升了视频生成的时间跨度。

🎯 应用场景

该研究成果可应用于电影制作、游戏开发、广告设计等领域，能够根据静态图像生成高质量、可控的动画视频，降低动画制作的成本和门槛。未来，该技术有望进一步拓展到虚拟现实、增强现实等领域，为用户提供更加沉浸式的体验。

📄 摘要（原文）

Generating realistic animated videos from static images is an important area of research in computer vision. Methods based on physical simulation and motion prediction have achieved notable advances, but they are often limited to specific object textures and motion trajectories, failing to exhibit highly complex environments and physical dynamics. In this paper, we introduce an open-domain controllable image animation method using motion priors with video diffusion models. Our method achieves precise control over the direction and speed of motion in the movable region by extracting the motion field information from videos and learning moving trajectories and strengths. Current pretrained video generation models are typically limited to producing very short videos, typically less than 30 frames. In contrast, we propose an efficient long-duration video generation method based on noise reschedule specifically tailored for image animation tasks, facilitating the creation of videos over 100 frames in length while maintaining consistency in content scenery and motion coordination. Specifically, we decompose the denoise process into two distinct phases: the shaping of scene contours and the refining of motion details. Then we reschedule the noise to control the generated frame sequences maintaining long-distance noise correlation. We conducted extensive experiments with 10 baselines, encompassing both commercial tools and academic methodologies, which demonstrate the superiority of our method. Our project page: https://wangqiang9.github.io/Controllable.github.io/

Controllable Longer Image Animation with Diffusion Models

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理