ProbTalk3D: Non-Deterministic Emotion Controllable Speech-Driven 3D Facial Animation Synthesis Using VQ-VAE

📄 arXiv: 2409.07966v4 📥 PDF

作者: Sichun Wu, Kazi Injamamul Haque, Zerrin Yumak

分类: cs.CV, cs.AI

发布日期: 2024-09-12 (更新: 2025-02-16)

备注: 14 pages, 9 figures, 3 tables. Includes code. Accepted at ACM SIGGRAPH MIG 2024

DOI: 10.1145/3677388.3696320

🔗 代码/项目: GITHUB


💡 一句话要点

提出ProbTalk3D以解决情感控制的语音驱动3D面部动画合成问题

🎯 匹配领域: 支柱四:生成式动作 (Generative Motion)

关键词: 3D面部动画 情感控制 非确定性生成 VQ-VAE 音频驱动 深度学习 计算机视觉

📋 核心要点

  1. 现有的音频驱动3D面部动画合成方法主要关注唇同步和身份控制,缺乏情感表达的考虑。
  2. 本文提出ProbTalk3D,利用两阶段VQ-VAE模型实现情感可控的非确定性3D面部动画合成。
  3. 实验结果表明,ProbTalk3D在情感控制的面部动画合成上优于现有的确定性和非确定性模型。

📝 摘要(中文)

音频驱动的3D面部动画合成是一个备受关注的研究领域。尽管已有一些令人鼓舞的成果,但现有方法主要集中在唇同步和身份控制上,忽视了情感及其控制在生成过程中的重要性。为了解决这一问题,本文提出了ProbTalk3D,一个基于两阶段VQ-VAE模型的非确定性神经网络方法,能够实现情感可控的语音驱动3D面部动画合成。我们通过与最新的3D面部动画合成方法进行广泛的比较分析,展示了该模型在性能上的优越性,并提供了公开的代码库供研究者使用。

🔬 方法详解

问题定义:现有的音频驱动3D面部动画合成方法通常是确定性的,即相同的音频输入会产生相同的输出,缺乏情感表达的多样性和丰富性。

核心思路:ProbTalk3D通过引入情感控制和非确定性生成机制,旨在生成多样化且情感丰富的面部动画。该方法利用VQ-VAE模型的优势,结合情感标签和强度,实现对情感的精确控制。

技术框架:该方法采用两阶段VQ-VAE模型,首先进行音频特征提取,然后生成对应的3D面部动画。模型的设计允许在生成过程中引入随机性,从而实现非确定性输出。

关键创新:ProbTalk3D是首个结合丰富情感数据集和情感控制的非确定性3D面部动画合成方法,显著提升了生成动画的情感表达能力。

关键设计:模型中使用了多种损失函数以优化生成质量,包括重构损失和情感一致性损失,同时在网络结构上采用了适应性参数设置,以提高生成的多样性和情感表达的准确性。

🖼️ 关键图片

fig_0
fig_1
fig_2

📊 实验亮点

实验结果显示,ProbTalk3D在情感控制的3D面部动画合成上表现优越,相较于最新的确定性和非确定性模型,生成的动画在情感表达上有显著提升,具体性能数据未详细列出,但整体效果优于现有方法。

🎯 应用场景

该研究的潜在应用领域包括虚拟现实、游戏开发、影视制作等,能够为角色动画提供更自然和情感丰富的表现。未来,该技术可能在社交机器人和人机交互中发挥重要作用,提升用户体验。

📄 摘要(原文)

Audio-driven 3D facial animation synthesis has been an active field of research with attention from both academia and industry. While there are promising results in this area, recent approaches largely focus on lip-sync and identity control, neglecting the role of emotions and emotion control in the generative process. That is mainly due to the lack of emotionally rich facial animation data and algorithms that can synthesize speech animations with emotional expressions at the same time. In addition, majority of the models are deterministic, meaning given the same audio input, they produce the same output motion. We argue that emotions and non-determinism are crucial to generate diverse and emotionally-rich facial animations. In this paper, we propose ProbTalk3D a non-deterministic neural network approach for emotion controllable speech-driven 3D facial animation synthesis using a two-stage VQ-VAE model and an emotionally rich facial animation dataset 3DMEAD. We provide an extensive comparative analysis of our model against the recent 3D facial animation synthesis approaches, by evaluating the results objectively, qualitatively, and with a perceptual user study. We highlight several objective metrics that are more suitable for evaluating stochastic outputs and use both in-the-wild and ground truth data for subjective evaluation. To our knowledge, that is the first non-deterministic 3D facial animation synthesis method incorporating a rich emotion dataset and emotion control with emotion labels and intensity levels. Our evaluation demonstrates that the proposed model achieves superior performance compared to state-of-the-art emotion-controlled, deterministic and non-deterministic models. We recommend watching the supplementary video for quality judgement. The entire codebase is publicly available (https://github.com/uuembodiedsocialai/ProbTalk3D/).