Pre-Trained Foundation Model representations to uncover Breathing patterns in Speech

作者: Vikramjit Mitra, Anirban Chatterjee, Ke Zhai, Helen Weng, Ayuko Hill, Nicole Hay, Christopher Webb, Jamie Cheng, Erdrin Azemi

分类: cs.SD, cs.CL, cs.LG, eess.AS

发布日期: 2024-07-17

备注: 8 pages, 6 figures, BioKDD workshop paper

💡 一句话要点

利用预训练模型表征，从语音中识别呼吸模式以进行呼吸率估计

🎯 匹配领域: 支柱二：RL算法与架构 (RL & Architecture) 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 呼吸率估计 语音分析 预训练模型 Wav2Vec2 Conv-LSTM 时间序列建模 健康监测

📋 核心要点

呼吸率是评估个体健康状况的重要指标，传统测量方法依赖专用设备或专业训练，成本较高。
该研究利用语音信号估计呼吸率，无需专用设备，通过预训练模型提取语音特征，并使用Conv-LSTM网络进行时间序列建模。
实验表明，使用预训练模型表征能有效估计呼吸时间序列，并以较低的平均绝对误差估计呼吸率，具有实际应用潜力。

📝 摘要（中文）

本文提出了一种基于机器学习的方法，利用语音片段估计呼吸率(RR)。该方法使用近距离麦克风采集受试者的语音数据，并使用商用胸带获取的RR数据作为ground truth，经过手动校正。论文提出了一种卷积长短期记忆网络(Conv-LSTM)从语音信号中估计呼吸时间序列数据。实验结果表明，使用Wav2Vec2等预训练模型获得的表征，能够以较低的均方根误差和较高的相关系数估计呼吸时间序列，优于基线方法。基于模型驱动的时间序列，可以以较低的平均绝对误差(MAE)（约1.6 breaths/min）估计RR。

🔬 方法详解

问题定义：论文旨在解决呼吸率（RR）的无创、便捷估计问题。现有方法依赖专业设备或生物传感器，成本高且不方便。语音中蕴含的呼吸信息未被充分利用，缺乏有效的语音到呼吸率的估计方法。

核心思路：论文的核心思路是利用预训练的语音表征模型（如Wav2Vec2）提取语音中的呼吸相关特征，然后使用时序模型（Conv-LSTM）学习语音特征与呼吸时间序列之间的映射关系。这种方法避免了手工设计特征的复杂性，并能充分利用大规模无标注语音数据进行预训练。

技术框架：整体框架包括数据采集、预处理、特征提取和呼吸率估计四个阶段。首先，使用近距离麦克风采集语音数据，并使用胸带获取呼吸数据作为ground truth。然后，使用预训练的Wav2Vec2模型提取语音特征。接着，使用Conv-LSTM网络对语音特征进行时序建模，预测呼吸时间序列。最后，从预测的呼吸时间序列中估计呼吸率。

关键创新：该研究的关键创新在于将预训练的语音表征模型应用于呼吸率估计任务。通过利用预训练模型学习到的通用语音特征，可以有效地提取语音中的呼吸相关信息，从而提高呼吸率估计的准确性。此外，使用Conv-LSTM网络进行时序建模，能够有效地捕捉呼吸模式的时间依赖性。

关键设计：论文使用Wav2Vec2作为预训练模型，提取语音的深层表征。Conv-LSTM网络由卷积层和LSTM层组成，用于学习语音特征与呼吸时间序列之间的映射关系。损失函数采用均方根误差（RMSE）和相关系数（Correlation Coefficient）来衡量预测的呼吸时间序列与ground truth之间的差异。呼吸率的估计通过分析预测的呼吸时间序列来实现。

🖼️ 关键图片

📊 实验亮点

实验结果表明，使用预训练的Wav2Vec2模型提取的语音特征，结合Conv-LSTM网络，能够有效地估计呼吸时间序列，并以较低的平均绝对误差（MAE）约1.6 breaths/min估计呼吸率。与基线方法相比，该方法在RMSE和相关系数上均有显著提升，验证了预训练模型在呼吸率估计任务中的有效性。

🎯 应用场景

该研究成果可应用于远程健康监测、运动健康管理、心理压力评估等领域。通过分析用户的语音，可以实时监测其呼吸率，及时发现潜在的健康问题。该技术无需用户佩戴任何传感器，使用方便，具有广泛的应用前景。未来可进一步研究不同场景下的呼吸率估计，例如在噪声环境或不同语速下的应用。

📄 摘要（原文）

The process of human speech production involves coordinated respiratory action to elicit acoustic speech signals. Typically, speech is produced when air is forced from the lungs and is modulated by the vocal tract, where such actions are interspersed by moments of breathing in air (inhalation) to refill the lungs again. Respiratory rate (RR) is a vital metric that is used to assess the overall health, fitness, and general well-being of an individual. Existing approaches to measure RR (number of breaths one takes in a minute) are performed using specialized equipment or training. Studies have demonstrated that machine learning algorithms can be used to estimate RR using bio-sensor signals as input. Speech-based estimation of RR can offer an effective approach to measure the vital metric without requiring any specialized equipment or sensors. This work investigates a machine learning based approach to estimate RR from speech segments obtained from subjects speaking to a close-talking microphone device. Data were collected from N=26 individuals, where the groundtruth RR was obtained through commercial grade chest-belts and then manually corrected for any errors. A convolutional long-short term memory network (Conv-LSTM) is proposed to estimate respiration time-series data from the speech signal. We demonstrate that the use of pre-trained representations obtained from a foundation model, such as Wav2Vec2, can be used to estimate respiration-time-series with low root-mean-squared error and high correlation coefficient, when compared with the baseline. The model-driven time series can be used to estimate $RR$ with a low mean absolute error (MAE) ~ 1.6 breaths/min.

Pre-Trained Foundation Model representations to uncover Breathing patterns in Speech

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理