FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs

📄 arXiv: 2407.04051v3 📥 PDF

作者: Keyu An, Qian Chen, Chong Deng, Zhihao Du, Changfeng Gao, Zhifu Gao, Yue Gu, Ting He, Hangrui Hu, Kai Hu, Shengpeng Ji, Yabin Li, Zerui Li, Heng Lu, Haoneng Luo, Xiang Lv, Bin Ma, Ziyang Ma, Chongjia Ni, Changhe Song, Jiaqi Shi, Xian Shi, Hao Wang, Wen Wang, Yuxuan Wang, Zhangyu Xiao, Zhijie Yan, Yexin Yang, Bin Zhang, Qinglin Zhang, Shiliang Zhang, Nan Zhao, Siqi Zheng

分类: cs.SD, cs.AI, eess.AS

发布日期: 2024-07-04 (更新: 2024-07-11)

备注: Work in progress. Authors are listed in alphabetical order by family name


💡 一句话要点

FunAudioLLM:用于人与LLM自然交互的语音理解与生成基础模型

🎯 匹配领域: 支柱九:具身大模型 (Embodied Foundation Models)

关键词: 语音理解 语音生成 大型语言模型 多语言 情感识别

📋 核心要点

  1. 现有语音交互系统在多语言支持、情感表达和个性化语音生成方面存在不足,限制了人机交互的自然性和丰富性。
  2. FunAudioLLM通过SenseVoice和CosyVoice两个模型,分别解决语音理解和语音生成问题,实现更自然、更可控的语音交互。
  3. SenseVoice在多语言语音识别方面表现出色,CosyVoice在多语言语音生成、语音克隆和指令跟随方面展现了强大的能力。

📝 摘要(中文)

本报告介绍了FunAudioLLM,一个旨在增强人与大型语言模型(LLM)之间自然语音交互的模型家族。其核心是两个创新模型:SenseVoice,用于处理多语种语音识别、情感识别和音频事件检测;以及CosyVoice,用于促进自然语音生成,并能控制多种语言、音色、说话风格和说话人身份。SenseVoice-Small为5种语言提供极低延迟的ASR,而SenseVoice-Large支持超过50种语言的高精度ASR。CosyVoice擅长多语种语音生成、零样本上下文学习、跨语言语音克隆和指令遵循能力。与SenseVoice和CosyVoice相关的模型已在Modelscope和Huggingface上开源,并在GitHub上发布了相应的训练、推理和微调代码。通过将这些模型与LLM集成,FunAudioLLM能够实现语音到语音翻译、情感语音聊天、互动播客和富有表现力的有声读物叙述等应用,从而推动语音交互技术的边界。演示可在https://fun-audio-llm.github.io上找到,代码可在https://github.com/FunAudioLLM上访问。

🔬 方法详解

问题定义:论文旨在解决人与大型语言模型(LLM)进行自然语音交互的问题。现有方法在多语言支持、情感识别、语音生成控制(如音色、风格、说话人身份)等方面存在局限性,无法提供流畅、自然、个性化的语音交互体验。

核心思路:论文的核心思路是构建两个专门的模型:SenseVoice负责语音理解,CosyVoice负责语音生成。SenseVoice专注于多语言语音识别、情感识别和音频事件检测,CosyVoice专注于多语言语音生成,并提供对音色、说话风格和说话人身份的精细控制。通过将这两个模型与LLM结合,实现更自然、更丰富的语音交互。

技术框架:FunAudioLLM的整体框架包含两个主要模块:SenseVoice和CosyVoice。SenseVoice负责将语音输入转换为文本和情感信息,CosyVoice负责根据LLM的输出生成自然语音。SenseVoice包含Small和Large两个版本,分别针对低延迟和高精度ASR。CosyVoice则专注于多语言语音生成、零样本上下文学习、跨语言语音克隆和指令遵循。

关键创新:论文的关键创新在于:1) 提出了SenseVoice和CosyVoice两个专门的模型,分别负责语音理解和语音生成;2) CosyVoice在多语言语音生成、语音克隆和指令跟随方面展现了强大的能力,能够生成具有丰富情感和个性化特征的语音;3) 通过将SenseVoice和CosyVoice与LLM结合,实现了更自然、更丰富的语音交互体验。

关键设计:SenseVoice-Small针对低延迟场景进行了优化,SenseVoice-Large则侧重于高精度语音识别,支持超过50种语言。CosyVoice的具体网络结构、损失函数和训练策略等技术细节在论文中未详细描述,属于未知信息。

🖼️ 关键图片

fig_0
fig_1
fig_2

📊 实验亮点

SenseVoice-Small在5种语言上实现了极低的延迟ASR,SenseVoice-Large支持超过50种语言的高精度ASR。CosyVoice在多语言语音生成、零样本上下文学习、跨语言语音克隆和指令遵循方面表现出色,能够生成具有丰富情感和个性化特征的语音。

🎯 应用场景

FunAudioLLM具有广泛的应用前景,包括语音到语音翻译、情感语音聊天、互动播客、个性化有声读物叙述等。该研究成果可以提升人机交互的自然性和效率,为用户提供更丰富、更个性化的语音交互体验,推动语音交互技术在各个领域的应用。

📄 摘要(原文)

This report introduces FunAudioLLM, a model family designed to enhance natural voice interactions between humans and large language models (LLMs). At its core are two innovative models: SenseVoice, which handles multilingual speech recognition, emotion recognition, and audio event detection; and CosyVoice, which facilitates natural speech generation with control over multiple languages, timbre, speaking style, and speaker identity. SenseVoice-Small delivers exceptionally low-latency ASR for 5 languages, and SenseVoice-Large supports high-precision ASR for over 50 languages, while CosyVoice excels in multi-lingual voice generation, zero-shot in-context learning, cross-lingual voice cloning, and instruction-following capabilities. The models related to SenseVoice and CosyVoice have been open-sourced on Modelscope and Huggingface, along with the corresponding training, inference, and fine-tuning codes released on GitHub. By integrating these models with LLMs, FunAudioLLM enables applications such as speech-to-speech translation, emotional voice chat, interactive podcasts, and expressive audiobook narration, thereby pushing the boundaries of voice interaction technology. Demos are available at https://fun-audio-llm.github.io, and the code can be accessed at https://github.com/FunAudioLLM.