InSerter: Speech Instruction Following with Unsupervised Interleaved Pre-training

作者: Dingdong Wang, Jin Xu, Ruihang Chu, Zhifang Guo, Xiong Wang, Jincenzi Wu, Dongchao Yang, Shengpeng Ji, Junyang Lin

分类: cs.SD, cs.CL, cs.HC, eess.AS

发布日期: 2025-03-04 (更新: 2025-06-04)

备注: Accepted to ACL 2025; Data is available at: https://huggingface.co/datasets/ddwang2000/SpeechInstructBench

💡 一句话要点

提出InSerter：一种基于非监督交错预训练的语音指令跟随方法

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 语音大语言模型 语音指令跟随 无监督预训练 语音文本交错 文本到语音转换

📋 核心要点

现有语音大语言模型在处理语音指令时性能不佳，语音输入相比文本输入模型智能显著下降。
InSerter通过无监督交错预训练，使模型学习语音片段到文本延续的映射，无需人工设计数据。
InSerter在SpeechInstructBench上达到SOTA，并在多个语音处理任务中表现出优异或竞争力的结果。

📝 摘要（中文）

本文提出了一种简单且可扩展的训练方法InSerter，用于提升语音大语言模型（SpeechLLMs）在语音指令跟随方面的性能。InSerter通过预训练大规模无监督的语音-文本序列来实现，其中语音是通过文本到语音转换技术，从海量文本语料库中随机选择的片段合成而来。该模型因此能够生成与给定语音片段相对应的文本延续，从而避免了对数据对的精细设计。为了系统地评估语音指令跟随能力，我们构建了首个专门为语音指令跟随任务设计的综合基准测试SpeechInstructBench。实验结果表明，InSerter在SpeechInstructBench上取得了SOTA性能，并在各种语音处理任务中表现出优越或具有竞争力的结果。

🔬 方法详解

问题定义：现有语音大语言模型在语音指令跟随任务中表现不佳，尤其是在处理语音输入时，模型的智能程度明显低于处理文本输入。现有的解决方法，如表示和行为对齐，需要精心设计数据对，增加了训练的复杂性，且效果有限。

核心思路：InSerter的核心思路是通过大规模无监督的语音-文本序列预训练，让模型学习语音片段到文本延续的映射关系。通过将文本转换为语音，并与原始文本交错，模型可以学习到语音和文本之间的语义关联，从而提高语音指令跟随的准确性。

技术框架：InSerter的整体框架包括以下几个步骤：1) 从大规模文本语料库中随机选择文本片段；2) 使用文本到语音（TTS）转换技术将这些文本片段合成为语音；3) 将合成的语音片段与原始文本片段交错，形成语音-文本序列；4) 使用这些序列对语音大语言模型进行预训练，使其能够根据给定的语音片段生成相应的文本延续。

关键创新：InSerter的关键创新在于其无监督交错预训练方法。与需要人工设计数据对的现有方法不同，InSerter利用大规模的无监督数据，通过TTS技术自动生成语音-文本序列，从而降低了训练成本，并提高了模型的泛化能力。

关键设计：InSerter的关键设计包括：1) 使用高质量的TTS模型生成逼真的语音片段；2) 合理设置语音片段和文本片段的长度比例，以保证模型的训练效果；3) 使用标准的语言模型训练目标，例如masked language modeling或causal language modeling，来训练模型生成文本延续。

🖼️ 关键图片

📊 实验亮点

InSerter在SpeechInstructBench上取得了SOTA性能，证明了其在语音指令跟随方面的有效性。同时，InSerter在其他语音处理任务中也表现出优越或具有竞争力的结果，表明其具有良好的泛化能力。具体性能数据未知，但论文强调了其超越现有方法的优势。

🎯 应用场景

InSerter具有广泛的应用前景，例如智能助手、语音搜索、语音翻译等。它可以提高语音交互的自然性和准确性，改善用户体验。此外，该方法还可以应用于其他语音相关的任务，例如语音识别、语音合成等，促进语音技术的发展。

📄 摘要（原文）

Recent advancements in speech large language models (SpeechLLMs) have attracted considerable attention. Nonetheless, current methods exhibit suboptimal performance in adhering to speech instructions. Notably, the intelligence of models significantly diminishes when processing speech-form input as compared to direct text-form input. Prior work has attempted to mitigate this semantic inconsistency between speech and text representations through techniques such as representation and behavior alignment, which involve the meticulous design of data pairs during the post-training phase. In this paper, we introduce a simple and scalable training method called InSerter, which stands for Interleaved Speech-Text Representation Pre-training. InSerter is designed to pre-train large-scale unsupervised speech-text sequences, where the speech is synthesized from randomly selected segments of an extensive text corpus using text-to-speech conversion. Consequently, the model acquires the ability to generate textual continuations corresponding to the provided speech segments, obviating the need for intensive data design endeavors. To systematically evaluate speech instruction-following capabilities, we introduce SpeechInstructBench, the first comprehensive benchmark specifically designed for speech-oriented instruction-following tasks. Our proposed InSerter achieves SOTA performance in SpeechInstructBench and demonstrates superior or competitive results across diverse speech processing tasks.

InSerter: Speech Instruction Following with Unsupervised Interleaved Pre-training

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理