Towards unified brain-to-text decoding across speech production and perception

📄 arXiv: 2603.12628v1 📥 PDF

作者: Zhizhang Yuan, Yang Yang, Gaorui Zhang, Baowen Cheng, Zehan Wu, Yuhao Xu, Xiaoying Liu, Liang Chen, Ying Mao, Meng Li

分类: q-bio.NC, cs.AI, eess.SP

发布日期: 2026-03-13

备注: 37 pages, 9 figures


💡 一句话要点

提出统一的脑到文本解码框架以解决多模态语言理解问题

🎯 匹配领域: 支柱九:具身大模型 (Embodied Foundation Models)

关键词: 脑机接口 语言模型 多模态解码 普通话 神经信号

📋 核心要点

  1. 现有的脑到文本解码方法多集中于单一模态,缺乏对多模态语言的统一处理能力。
  2. 本文提出的框架通过对普通话的言语产生和感知进行统一解码,增强了模型的泛化能力。
  3. 实验结果表明,该框架在性能上超越了许多大型商业语言模型,展示了其有效性。

📝 摘要(中文)

人类日常交流主要依赖于言语的产生和感知。以往的脑到文本解码研究多集中于单一模态和字母语言。本文提出了一种统一的脑到句子解码框架,适用于普通话的言语产生和感知。该框架展现出强大的泛化能力,仅通过单字符数据训练即可实现句子级解码,并支持未见过的字符和音节。此外,框架允许对不同模态的神经动态进行直接和可控的比较。通过对神经信号中的音节成分进行分类,结合后训练的大型语言模型,实现了普通话的解码。研究结果为多模态语言解码系统的开发奠定了基础。

🔬 方法详解

问题定义:本文旨在解决现有脑到文本解码方法在多模态语言理解中的局限性,尤其是对普通话的言语产生和感知的统一解码能力不足。

核心思路:提出一种统一的脑到句子解码框架,利用神经信号中的音节成分分类,结合后训练的大型语言模型,实现对普通话的高效解码。

技术框架:整体架构包括三个主要阶段:首先从神经信号中分类音节成分(声母和韵母),然后使用后训练的7亿参数大型语言模型将音节序列映射为中文句子,最后进行两阶段推理以优化解码效果。

关键创新:该框架的创新在于其强大的泛化能力,能够在仅用单字符数据训练的情况下实现句子级解码,并支持未见字符和音节的处理。

关键设计:在设计中,采用了三阶段的后训练和两阶段的推理框架,优化了模型参数设置,确保解码性能超越了数百亿参数的商业大型语言模型。具体的损失函数和网络结构细节未在摘要中明确说明,需参考原文获取更多信息。

🖼️ 关键图片

fig_0
fig_1
fig_2

📊 实验亮点

实验结果显示,该框架在解码性能上超越了许多大型商业语言模型,尤其是在处理未见字符和音节时表现出色。具体性能数据未在摘要中提供,需参考原文获取详细信息。

🎯 应用场景

该研究的潜在应用领域包括脑机接口、语言障碍康复和人机交互等。通过实现多模态语言的解码,能够为不同语言背景的用户提供更为自然的交流方式,推动智能助手和辅助技术的发展。

📄 摘要(原文)

Speech production and perception are the main ways humans communicate daily. Prior brain-to-text decoding studies have largely focused on a single modality and alphabetic languages. Here, we present a unified brain-to-sentence decoding framework for both speech production and perception in Mandarin Chinese. The framework exhibits strong generalization ability, enabling sentence-level decoding when trained only on single-character data and supporting characters and syllables unseen during training. In addition, it allows direct and controlled comparison of neural dynamics across modalities. Mandarin speech is decoded by first classifying syllable components in Hanyu Pinyin, namely initials and finals, from neural signals, followed by a post-trained large language model (LLM) that maps sequences of toneless Pinyin syllables to Chinese sentences. To enhance LLM decoding, we designed a three-stage post-training and two-stage inference framework based on a 7-billion-parameter LLM, achieving overall performance that exceeds larger commercial LLMs with hundreds of billions of parameters or more. In addition, several characteristics were observed in Mandarin speech production and perception: speech production involved neural responses across broader cortical regions than auditory perception; channels responsive to both modalities exhibited similar activity patterns, with speech perception showing a temporal delay relative to production; and decoding performance was broadly comparable across hemispheres. Our work not only establishes the feasibility of a unified decoding framework but also provides insights into the neural characteristics of Mandarin speech production and perception. These advances contribute to brain-to-text decoding in logosyllabic languages and pave the way toward neural language decoding systems supporting multiple modalities.