Audio-Aware Large Language Models as Judges for Speaking Styles

作者: Cheng-Han Chiang, Xiaofei Wang, Chung-Ching Lin, Kevin Lin, Linjie Li, Radu Kopetz, Yao Qian, Zhendong Wang, Zhengyuan Yang, Hung-yi Lee, Lijuan Wang

分类: eess.AS, cs.AI, cs.CL

发布日期: 2025-06-06

💡 一句话要点

提出音频感知大型语言模型作为演讲风格评估工具

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 音频感知 大型语言模型 演讲风格评估 口语模型 多模态评估

📋 核心要点

现有的口语模型在演讲风格的控制和自然对话生成方面存在不足，难以满足多样化的应用需求。
本文提出使用音频感知大型语言模型（ALLMs）作为自动评估工具，能够综合评估演讲的多种风格特征。
实验结果显示，Gemini与人类评估者的评估一致性良好，表明ALLMs在评估SLMs方面具有潜力。

📝 摘要（中文）

音频感知大型语言模型（ALLMs）能够理解音频输入中的文本和非文本信息。本文探讨了使用ALLMs作为自动评估工具来评估演讲的风格。我们利用ALLM评估者对由口语模型（SLMs）生成的演讲进行评估，任务包括声音风格指令跟随和角色扮演。考虑的演讲风格包括情感、音量、语速、词语重音、音调控制和非语言元素。通过与人类评估者的比较，结果表明，Gemini与人类评估者之间的协议程度相当于人类评估者之间的协议。这些结果表明，ALLMs可以作为评估SLMs的工具，同时也揭示了当前SLMs在控制演讲风格和生成自然对话方面仍有改进空间。

🔬 方法详解

问题定义：本文旨在解决现有口语模型在演讲风格评估中的不足，尤其是在情感、语速等多样化风格控制方面的挑战。

核心思路：通过引入音频感知大型语言模型（ALLMs），实现对演讲风格的自动评估，提升评估的准确性和效率。

技术框架：整体架构包括四个口语模型（SLMs）生成演讲内容，ALLMs作为评估者对生成内容进行风格评估，实验中还引入人类评估者作为对比。

关键创新：本研究的创新点在于将ALLMs应用于演讲风格评估，提供了一种新的评估方式，且评估结果与人类评估者的结果高度一致。

关键设计：在实验中，使用了多种评估指标来衡量演讲风格的各个方面，包括情感、音量、语速等，确保评估的全面性和准确性。

📊 实验亮点

实验结果表明，Gemini与人类评估者之间的协议程度相当，显示出ALLMs在评估口语模型生成内容方面的有效性。此外，当前的SLMs在演讲风格控制方面仍有改进空间，提示未来的研究方向。

🎯 应用场景

该研究的潜在应用领域包括教育、公共演讲培训、语音助手等，能够为演讲者提供实时反馈，帮助其提升演讲技巧。未来，随着技术的进步，ALLMs在多模态评估中的应用将更加广泛，推动人机交互的发展。

📄 摘要（原文）

Audio-aware large language models (ALLMs) can understand the textual and non-textual information in the audio input. In this paper, we explore using ALLMs as an automatic judge to assess the speaking styles of speeches. We use ALLM judges to evaluate the speeches generated by SLMs on two tasks: voice style instruction following and role-playing. The speaking style we consider includes emotion, volume, speaking pace, word emphasis, pitch control, and non-verbal elements. We use four spoken language models (SLMs) to complete the two tasks and use humans and ALLMs to judge the SLMs' responses. We compare two ALLM judges, GPT-4o-audio and Gemini-2.5-pro, with human evaluation results and show that the agreement between Gemini and human judges is comparable to the agreement between human evaluators. These promising results show that ALLMs can be used as a judge to evaluate SLMs. Our results also reveal that current SLMs, even GPT-4o-audio, still have room for improvement in controlling the speaking style and generating natural dialogues.

Audio-Aware Large Language Models as Judges for Speaking Styles

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册