MinMo: A Multimodal Large Language Model for Seamless Voice Interaction

作者: Qian Chen, Yafeng Chen, Yanni Chen, Mengzhe Chen, Yingda Chen, Chong Deng, Zhihao Du, Ruize Gao, Changfeng Gao, Zhifu Gao, Yabin Li, Xiang Lv, Jiaqing Liu, Haoneng Luo, Bin Ma, Chongjia Ni, Xian Shi, Jialong Tang, Hui Wang, Hao Wang, Wen Wang, Yuxuan Wang, Yunlan Xu, Fan Yu, Zhijie Yan, Yexin Yang, Baosong Yang, Xian Yang, Guanrou Yang, Tianyu Zhao, Qinglin Zhang, Shiliang Zhang, Nan Zhao, Pei Zhang, Chong Zhang, Jinren Zhou

分类: cs.CL, cs.AI, cs.HC, cs.SD, eess.AS

发布日期: 2025-01-10

备注: Work in progress. Authors are listed in alphabetical order by family name

🔗 代码/项目: PROJECT_PAGE

💡 一句话要点

MinMo：用于无缝语音交互的多模态大型语言模型

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 多模态学习 大型语言模型 语音交互 语音生成 双工对话

📋 核心要点

现有语音交互模型在处理序列长度差异、预训练不足或数据集规模受限等方面存在不足。
MinMo通过多阶段对齐训练，包括语音-文本、文本-语音、语音-语音以及双工交互，实现无缝语音交互。
MinMo在语音理解和生成方面达到SOTA，同时保持文本LLM能力，并支持全双工对话，延迟较低。

📝 摘要（中文）

本文介绍了一种名为MinMo的多模态大型语言模型，该模型拥有约80亿参数，专为无缝语音交互而设计。现有语音交互模型分为原生模型和对齐模型。原生模型将语音和文本处理集成在一个框架中，但难以处理序列长度差异和预训练不足等问题。对齐模型保持了文本LLM的能力，但通常受限于小数据集和对语音任务的狭窄关注。MinMo通过多阶段训练解决了现有对齐多模态模型的主要局限性，包括语音到文本对齐、文本到语音对齐、语音到语音对齐以及双工交互对齐，训练数据包含140万小时的多样化语音数据和广泛的语音任务。经过多阶段训练，MinMo在语音理解和生成方面取得了最先进的性能，同时保持了文本LLM的能力，并支持全双工对话。此外，本文还提出了一种新颖且简单的语音解码器，在语音生成方面优于以往模型。MinMo增强的指令遵循能力支持基于用户指令控制语音生成，包括情感、方言和语速等各种细微差别，并模仿特定声音。MinMo的语音到文本延迟约为100毫秒，理论上的全双工延迟约为600毫秒，实际延迟约为800毫秒。代码和模型即将发布。

🔬 方法详解

问题定义：论文旨在解决现有语音交互模型在处理复杂语音任务时存在的局限性，例如原生模型难以处理语音和文本序列长度差异，对齐模型受限于数据集规模和任务范围。现有方法难以兼顾语音理解、生成和双工交互，且延迟较高。

核心思路：论文的核心思路是通过多阶段对齐训练，将语音和文本模态进行深度融合，从而使模型能够更好地理解和生成语音，并支持自然流畅的双工交互。这种多阶段训练策略旨在克服单一模态或简单对齐方法的不足。

技术框架：MinMo的训练框架包含四个主要阶段：1) 语音到文本对齐：使模型能够准确地将语音转换为文本；2) 文本到语音对齐：使模型能够根据文本生成自然的语音；3) 语音到语音对齐：增强模型在不同语音风格和环境下的适应性；4) 双工交互对齐：使模型能够进行实时的双向语音对话。模型整体架构基于大型语言模型，并针对语音处理进行了优化。

关键创新：论文的关键创新在于多阶段对齐训练策略和新颖的语音解码器。多阶段对齐训练能够更有效地融合语音和文本模态，提高模型的语音理解和生成能力。提出的语音解码器在语音生成方面优于现有模型，能够生成更自然、更具表现力的语音。

关键设计：MinMo使用了约80亿参数的大型语言模型作为基础架构。在训练过程中，使用了140万小时的多样化语音数据。语音解码器采用了新的结构设计，具体细节未知。损失函数的设计也针对多阶段对齐进行了优化，具体细节未知。

🖼️ 关键图片

📊 实验亮点

MinMo在语音理解和生成方面取得了最先进的性能，并在全双工对话中实现了较低的延迟（理论600ms，实际800ms）。该模型能够根据用户指令控制语音生成，包括情感、方言和语速等细微差别，并模仿特定声音。此外，MinMo的语音到文本延迟仅为100ms，表明其具有很高的实时性。

🎯 应用场景

MinMo具有广泛的应用前景，包括智能助手、语音搜索、实时翻译、语音游戏、无障碍交流等。该模型能够实现更自然、更流畅的语音交互，提高用户体验，并为语音技术的进一步发展奠定基础。未来，MinMo有望应用于各种需要语音交互的场景，例如智能家居、车载系统、教育领域等。

📄 摘要（原文）

Recent advancements in large language models (LLMs) and multimodal speech-text models have laid the groundwork for seamless voice interactions, enabling real-time, natural, and human-like conversations. Previous models for voice interactions are categorized as native and aligned. Native models integrate speech and text processing in one framework but struggle with issues like differing sequence lengths and insufficient pre-training. Aligned models maintain text LLM capabilities but are often limited by small datasets and a narrow focus on speech tasks. In this work, we introduce MinMo, a Multimodal Large Language Model with approximately 8B parameters for seamless voice interaction. We address the main limitations of prior aligned multimodal models. We train MinMo through multiple stages of speech-to-text alignment, text-to-speech alignment, speech-to-speech alignment, and duplex interaction alignment, on 1.4 million hours of diverse speech data and a broad range of speech tasks. After the multi-stage training, MinMo achieves state-of-the-art performance across various benchmarks for voice comprehension and generation while maintaining the capabilities of text LLMs, and also facilitates full-duplex conversation, that is, simultaneous two-way communication between the user and the system. Moreover, we propose a novel and simple voice decoder that outperforms prior models in voice generation. The enhanced instruction-following capabilities of MinMo supports controlling speech generation based on user instructions, with various nuances including emotions, dialects, and speaking rates, and mimicking specific voices. For MinMo, the speech-to-text latency is approximately 100ms, full-duplex latency is approximately 600ms in theory and 800ms in practice. The MinMo project web page is https://funaudiollm.github.io/minmo, and the code and models will be released soon.

MinMo: A Multimodal Large Language Model for Seamless Voice Interaction

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理