PortraitDirector: A Hierarchical Disentanglement Framework for Controllable and Real-time Facial Reenactment

作者: Chaonan Ji, Jinwei Qi, Sheng Xu, Peng Zhang, Bang Zhang

分类: cs.CV

发布日期: 2026-04-21

备注: accepted by CVPR2026

💡 一句话要点

PortraitDirector：提出一种用于可控和实时面部重演的分层解耦框架

🎯 匹配领域: 支柱二：RL算法与架构 (RL & Architecture) 支柱四：生成式动作 (Generative Motion)

关键词: 面部重演 分层解耦 实时渲染 可控生成 扩散蒸馏

📋 核心要点

现有面部重演方法在表达能力和细粒度控制之间存在权衡，难以同时保证两者。
PortraitDirector将面部运动分解为空间层（物理运动）和语义层（情感内容），实现分层解耦和组合。
通过扩散蒸馏、因果注意力和VAE加速等优化，PortraitDirector实现了实时、高保真和可控的面部重演。

📝 摘要（中文）

现有的面部重演方法难以兼顾表达能力和细粒度可控性。整体面部重演模型通常牺牲精细控制以换取表达能力，而为控制而设计的方法可能在保真度和鲁棒解耦方面遇到困难。本文提出了一种新的框架PortraitDirector，将面部重演构建为分层组合任务，从而实现高保真和可控的结果。采用分层运动解耦和组合策略，将面部运动分解为用于物理运动的空间层和用于情感内容的语义层。空间层包括：(i)全局头部姿势，通过专用表示和注入路径进行管理；(ii)空间分离的局部面部表情，从裁剪的面部区域提取，并通过利用信息瓶颈的情感过滤模块消除情感线索。语义层包含导出的全局情感。然后将解耦的组件重新组合成富有表现力的运动潜在空间。此外，通过包括扩散蒸馏、因果注意力和VAE加速在内的一系列优化，该框架被设计为实现实时性能。PortraitDirector在单个5090 GPU上以20 FPS的速度实现流式传输、高保真、可控的512 x 512面部重演，端到端延迟为800 ms。

🔬 方法详解

问题定义：现有面部重演方法要么难以进行细粒度控制，要么在保真度和鲁棒性方面表现不佳。整体方法缺乏控制，而为控制设计的方法又难以保证真实感。因此，需要一种既能精细控制又能保证高保真度的面部重演方法。

核心思路：PortraitDirector的核心思路是将面部运动分解为空间层和语义层，分别控制物理运动和情感内容。通过解耦，可以独立地控制头部姿势、局部表情和全局情感，从而实现更精细的控制。这种分层解耦的思想借鉴了图像合成中常用的组合方法，并将其应用于面部重演领域。

技术框架：PortraitDirector框架包含以下主要模块：1) 头部姿势表示和注入模块，用于控制全局头部姿势；2) 情感过滤模块，用于从局部面部表情中去除情感信息；3) 局部表情提取模块，用于提取空间分离的局部面部表情；4) 全局情感提取模块，用于提取全局情感；5) 运动潜在空间重组模块，用于将解耦的组件重新组合成富有表现力的运动潜在空间。此外，还采用了扩散蒸馏、因果注意力和VAE加速等优化技术，以提高实时性能。

关键创新：PortraitDirector的关键创新在于其分层运动解耦和组合策略。与现有方法将面部运动视为单一信号不同，PortraitDirector将其分解为空间层和语义层，从而实现更精细的控制。此外，情感过滤模块能够有效地去除局部表情中的情感信息，保证了空间层和语义层的独立性。

关键设计：情感过滤模块利用信息瓶颈原理，通过限制局部表情所能携带的信息量，从而去除情感信息。扩散蒸馏用于加速扩散模型的推理速度，因果注意力用于保证时间一致性，VAE加速用于提高VAE模型的效率。具体的参数设置和网络结构细节在论文中有详细描述。

🖼️ 关键图片

📊 实验亮点

PortraitDirector在单个5090 GPU上实现了20 FPS的实时性能，端到端延迟为800 ms，分辨率为512 x 512。实验结果表明，PortraitDirector在保真度和可控性方面均优于现有方法。定性结果展示了PortraitDirector生成的高质量面部重演效果，以及对头部姿势、局部表情和全局情感的精细控制能力。

🎯 应用场景

PortraitDirector可应用于虚拟现实、增强现实、游戏、视频会议、虚拟主播等领域。该技术能够实现高度逼真和可控的虚拟形象，提升用户体验。例如，在视频会议中，用户可以使用PortraitDirector来调整自己的面部表情，从而更好地表达自己的情感。在游戏中，开发者可以使用PortraitDirector来创建更生动和富有表现力的角色。

📄 摘要（原文）

Existing facial reenactment methods struggle with a trade-off between expressiveness and fine-grained controllability. Holistic facial reenactment models often sacrifice granular control for expressiveness, while methods designed for control may struggle with fidelity and robust disentanglement. Instead of treating facial motion as a monolithic signal, we explore an alternative compositional perspective. In this paper, we introduce PortraitDirector, a novel framework that formulates face reenactment as a hierarchical composition task, achieving high-fidelity and controllable results. We employ a Hierarchical Motion Disentanglement and Composition strategy, deconstructing facial motion into a Spatial Layer for physical movements and a Semantic Layer for emotional content. The Spatial Layer comprises: (i) global head pose, managed via a dedicated representation and injection pathway; (ii) spatially separated local facial expressions, distilled from cropped facial regions and purged of emotional cues via Emotion-Filtering Module leveraging an information bottleneck. The Semantic Layer contains a derived global emotion. The disentangled components are then recomposed into an expressive motion latent. Furthermore, we engineer the framework for real-time performance through a suite of optimizations, including diffusion distillation, causal attention and VAE acceleration. PortraitDirector achieves streaming, high-fidelity, controllable 512 x 512 face reenactment at 20 FPS with a end-to-end 800 ms latency on a single 5090 GPU.

PortraitDirector: A Hierarchical Disentanglement Framework for Controllable and Real-time Facial Reenactment

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理