DiM-Gestor: Co-Speech Gesture Generation with Adaptive Layer Normalization Mamba-2
作者: Fan Zhang, Siyuan Zhao, Naye Ji, Zhaohan Wang, Jingmei Wu, Fuxing Gao, Zhenqing Ye, Leyao Yan, Lanxin Dai, Weidong Geng, Xin Lyu, Bozuo Zhao, Dingguo Yu, Hui Du, Bin Hu
分类: cs.SD, cs.AI, cs.GR, cs.HC, cs.MM, eess.AS
发布日期: 2024-11-23
备注: 13 pages, 11 figures
💡 一句话要点
提出DiM-Gestor以解决现有语音驱动手势生成模型的复杂性问题
🎯 匹配领域: 支柱二:RL算法与架构 (RL & Architecture)
关键词: 语音驱动手势生成 变换器模型 自适应层归一化 模糊特征提取 虚拟人类交互 扩散模型 中文共语手势数据集
📋 核心要点
- 现有的语音驱动手势生成模型在时间和空间复杂性上存在显著挑战,限制了其可扩展性和效率。
- DiM-Gestor通过引入模糊特征提取器和语音到手势映射模块,采用Mamba-2架构,提供了一种创新的解决方案。
- 实验结果显示,DiM-Gestor在内存使用上减少约2.4倍,推理速度提升2到4倍,且在新发布的中文共语手势数据集上表现优异。
📝 摘要(中文)
语音驱动的手势生成利用基于变换器的生成模型在虚拟人类创建中迅速发展。然而,现有模型面临显著的时间和空间复杂性挑战,限制了其可扩展性和效率。为了解决这些问题,我们提出了DiM-Gestor,这是一种创新的端到端生成模型,采用Mamba-2架构。DiM-Gestor具有双组件框架:模糊特征提取器和语音到手势映射模块,均基于Mamba-2。模糊特征提取器结合了中文预训练模型,能够自主提取隐含的连续语音特征。这些特征被合成到统一的潜在表示中,然后由语音到手势映射模块处理。该模块采用增强的自适应层归一化机制,能够精确建模语音特征与手势动态之间的微妙相互作用。我们利用扩散模型进行训练和推理,实验结果表明该模型在内存使用和推理速度上均有显著提升。
🔬 方法详解
问题定义:本论文旨在解决现有语音驱动手势生成模型在时间和空间复杂性上的不足,导致其在实际应用中的可扩展性和效率受限。
核心思路:提出DiM-Gestor模型,采用Mamba-2架构,通过模糊特征提取器和语音到手势映射模块的结合,提升手势生成的准确性和效率。
技术框架:DiM-Gestor由两个主要模块组成:模糊特征提取器负责提取语音特征,语音到手势映射模块则将这些特征映射为手势,整个过程通过自适应层归一化机制进行优化。
关键创新:引入自适应层归一化机制(AdaLN),使得模型能够在所有序列标记上均匀应用变换,从而精确建模语音特征与手势动态之间的关系,这是与现有方法的本质区别。
关键设计:模型结合了中文预训练模型,模糊特征提取器能够自主提取隐含的连续语音特征,最终通过扩散模型进行训练和推理,确保生成手势的多样性和自然性。
🖼️ 关键图片
📊 实验亮点
实验结果表明,DiM-Gestor在内存使用上减少约2.4倍,推理速度提升2到4倍,相较于传统的变换器架构,表现出竞争力的结果,验证了模型的有效性。
🎯 应用场景
该研究的潜在应用领域包括虚拟人类交互、动画制作、在线教育等,能够为这些领域提供更自然的交互方式和更高效的内容生成手段。未来,DiM-Gestor可能在增强现实和虚拟现实等新兴技术中发挥重要作用,提升用户体验。
📄 摘要(原文)
Speech-driven gesture generation using transformer-based generative models represents a rapidly advancing area within virtual human creation. However, existing models face significant challenges due to their quadratic time and space complexities, limiting scalability and efficiency. To address these limitations, we introduce DiM-Gestor, an innovative end-to-end generative model leveraging the Mamba-2 architecture. DiM-Gestor features a dual-component framework: (1) a fuzzy feature extractor and (2) a speech-to-gesture mapping module, both built on the Mamba-2. The fuzzy feature extractor, integrated with a Chinese Pre-trained Model and Mamba-2, autonomously extracts implicit, continuous speech features. These features are synthesized into a unified latent representation and then processed by the speech-to-gesture mapping module. This module employs an Adaptive Layer Normalization (AdaLN)-enhanced Mamba-2 mechanism to uniformly apply transformations across all sequence tokens. This enables precise modeling of the nuanced interplay between speech features and gesture dynamics. We utilize a diffusion model to train and infer diverse gesture outputs. Extensive subjective and objective evaluations conducted on the newly released Chinese Co-Speech Gestures dataset corroborate the efficacy of our proposed model. Compared with Transformer-based architecture, the assessments reveal that our approach delivers competitive results and significantly reduces memory usage, approximately 2.4 times, and enhances inference speeds by 2 to 4 times. Additionally, we released the CCG dataset, a Chinese Co-Speech Gestures dataset, comprising 15.97 hours (six styles across five scenarios) of 3D full-body skeleton gesture motion performed by professional Chinese TV broadcasters.