Visatronic: A Multimodal Decoder-Only Model for Speech Synthesis

作者: Akshita Gupta, Tatiana Likhomanenko, Karren Dai Yang, Richard He Bai, Zakaria Aldeneh, Navdeep Jaitly

分类: cs.MM, cs.CV, cs.SD, eess.AS

发布日期: 2024-11-26 (更新: 2025-05-29)

🔗 代码/项目: PROJECT_PAGE

💡 一句话要点

Visatronic：一种用于语音合成的多模态解码器模型，实现视频-文本到语音的生成。

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 多模态学习 语音合成 视频到语音 Transformer模型 解码器模型

📋 核心要点

现有方法依赖预训练LLM，难以准确建模跨模态的时间依赖关系，限制了多模态信息的有效融合。
Visatronic采用统一的解码器Transformer架构，将视频、文本和语音嵌入到共享空间，并探索token混合策略。
在VoxCeleb2和LRS3数据集上验证，Visatronic在零样本设定下超越了现有SOTA方法，并提出了新的时间同步度量TimeSync。

📝 摘要（中文）

大型语言模型（LLM）的快速发展显著提升了受益于多模态输入数据的机器学习系统的能力。然而，现有的多模态模型主要构建在预训练的LLM之上，这可能限制了跨模态时间依赖关系的精确建模，从而限制了模型联合处理和利用多模态输入的能力。为了专门研究LLM风格（仅解码器）模型中文本、视频和语音模态的对齐，我们考虑一个简化的多模态生成任务，即视频-文本到语音（VTTS）：基于相应的文本和说话人的视频生成语音。最终目标是生成不仅遵循文本，而且在时间上与视频对齐并与面部表情一致的语音。在本文中，我们首先介绍了Visatronic，一个统一的多模态解码器Transformer模型，它采用LLM风格的架构将视觉、文本和语音输入嵌入到共享子空间中，将所有模态视为时间对齐的token流。接下来，我们仔细研究了不同的token混合策略，以了解将视频和文本条件输入到音频生成步骤的最佳方式。我们在具有挑战性的VoxCeleb2数据集上广泛评估了Visatronic，并展示了到LRS3的零样本泛化，其中在VoxCeleb2上训练的Visatronic实现了4.5%的WER，优于先前仅在LRS3上训练的SOTA方法，后者报告了21.4%的WER。此外，我们提出了一种新的目标度量TimeSync，专门用于测量生成语音和参考语音之间的音素级时间对齐，进一步确保同步质量。

🔬 方法详解

问题定义：论文旨在解决视频-文本到语音（VTTS）生成任务，即根据给定的视频和文本生成对应的语音。现有方法主要依赖于预训练的大型语言模型（LLM），但这些方法难以准确建模不同模态之间的时间依赖关系，导致生成的语音与视频在时间上不同步，并且可能与面部表情不一致。

核心思路：Visatronic的核心思路是将所有模态（视频、文本、语音）统一表示为token序列，并使用一个解码器Transformer模型来学习这些token序列之间的关系。通过将所有模态嵌入到共享的子空间中，模型可以更好地理解它们之间的关联，从而生成更准确、更同步的语音。

技术框架：Visatronic的整体架构是一个解码器Transformer模型。该模型包含以下主要模块：1) 视频编码器：将视频帧转换为视觉token序列。2) 文本编码器：将文本转换为文本token序列。3) 语音编码器：将语音转换为语音token序列。4) 解码器：接收视频token、文本token和部分生成的语音token作为输入，并预测下一个语音token。模型采用LLM风格的训练方式，即自回归地预测下一个token。

关键创新：Visatronic的关键创新在于其统一的多模态表示和token混合策略。与以往方法不同，Visatronic不依赖于预训练的LLM，而是从头开始学习所有模态的表示。此外，论文还探索了不同的token混合策略，以确定将视频和文本信息传递到语音生成步骤的最佳方式。

关键设计：Visatronic的关键设计包括：1) 使用Transformer编码器将视频帧、文本和语音转换为token序列。2) 采用交叉注意力机制来融合不同模态的信息。3) 使用自回归解码器生成语音token序列。4) 提出了新的目标度量TimeSync，用于评估生成语音和参考语音之间的时间同步性。具体的参数设置和损失函数细节在论文中有详细描述。

🖼️ 关键图片

📊 实验亮点

Visatronic在VoxCeleb2数据集上进行了训练，并在LRS3数据集上进行了零样本泛化。实验结果表明，Visatronic在LRS3上实现了4.5%的WER，显著优于之前仅在LRS3上训练的SOTA方法（21.4%的WER）。此外，提出的TimeSync指标能够有效评估生成语音的时间同步性。

🎯 应用场景

Visatronic技术可应用于虚拟助手、视频会议、电影制作等领域。例如，它可以用于生成与视频内容同步的语音，提升用户体验。此外，该技术还可以用于语音修复，根据视频内容恢复缺失或损坏的语音片段。未来，该技术有望在人机交互、内容创作等领域发挥重要作用。

📄 摘要（原文）

The rapid progress of foundation models and large language models (LLMs) has fueled significantly improvement in the capabilities of machine learning systems that benefit from mutlimodal input data. However, existing multimodal models are predominantly built on top of pre-trained LLMs, which can limit accurate modeling of temporal dependencies across other modalities and thus limit the model's ability to jointly process and leverage multimodal inputs. To specifically investigate the alignment of text, video, and speech modalities in LLM-style (decoder-only) models, we consider a simplified multimodal generation task, Video-Text to Speech (VTTS): speech generation conditioned on both its corresponding text and video of talking people. The ultimate goal is to generate speech that not only follows the text but also aligns temporally with the video and is consistent with the facial expressions. In this paper, we first introduce Visatronic, a unified multimodal decoder-only transformer model that adopts an LLM-style architecture to embed visual, textual, and speech inputs into a shared subspace, treating all modalities as temporally aligned token streams. Next, we carefully explore different token mixing strategies to understand the best way to propagate information from the steps where video and text conditioning is input to the steps where the audio is generated. We extensively evaluate Visatronic on the challenging VoxCeleb2 dataset and demonstrate zero-shot generalization to LRS3, where Visatronic, trained on VoxCeleb2, achieves a 4.5% WER, outperforming prior SOTA methods trained only on LRS3, which report a 21.4% WER. Additionally, we propose a new objective metric, TimeSync, specifically designed to measure phoneme-level temporal alignment between generated and reference speech, further ensuring synchronization quality. Demo: https://apple.github.io/visatronic-demo/

Visatronic: A Multimodal Decoder-Only Model for Speech Synthesis

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理