VoiceAssistant-Eval: Benchmarking AI Assistants across Listening, Speaking, and Viewing

📄 arXiv: 2509.22651v1 📥 PDF

作者: Ke Wang, Houxing Ren, Zimu Lu, Mingjie Zhan, Hongsheng Li

分类: cs.CL, cs.AI, cs.CV, cs.HC, cs.SD

发布日期: 2025-09-26

🔗 代码/项目: PROJECT_PAGE


💡 一句话要点

提出VoiceAssistant-Eval以评估多模态AI助手的能力

🎯 匹配领域: 支柱九:具身大模型 (Embodied Foundation Models)

关键词: 多模态评估 AI助手 语音识别 自然语言处理 开源模型 听说视能力 任务分类

📋 核心要点

  1. 现有的评估基准无法全面评估AI助手在听、说、视等多模态能力上的表现,存在显著的不足。
  2. 本文提出VoiceAssistant-Eval基准,涵盖多种任务,旨在全面评估AI助手的多模态能力。
  3. 实验结果显示,开源模型在某些任务上表现优于专有模型,且中型模型在听力准确性上超越了大型模型。

📝 摘要(中文)

随着大型语言模型和多模态系统能力的提升,语音优先的AI助手引起了广泛关注。然而,现有的评估基准无法全面评估这些系统的能力。为此,本文提出了VoiceAssistant-Eval,这是一个综合基准,旨在评估AI助手在听、说、视等方面的表现。该基准包含10,497个精心策划的示例,涵盖13个任务类别,包括自然声音、音乐和对话等。通过对21个开源模型和GPT-4o-Audio的评估,发现大多数模型在说话任务上表现优异,但在音频理解方面存在不足。本文为下一代AI助手的评估和发展提供了严格的框架。

🔬 方法详解

问题定义:本文旨在解决现有评估基准无法全面评估AI助手多模态能力的问题,尤其是在听、说、视等方面的不足。

核心思路:通过构建VoiceAssistant-Eval基准,涵盖多种任务和场景,提供一个全面的评估框架,以便更好地理解和提升AI助手的性能。

技术框架:VoiceAssistant-Eval包括10,497个示例,分为听觉、语言和视觉三大模块,涵盖自然声音、对话、角色扮演等多种任务。

关键创新:该基准的创新在于其全面性和多样性,能够同时评估AI助手在听、说、视等多个维度的能力,填补了现有评估工具的空白。

关键设计:在设计中,任务类别的选择和示例的策划是关键,确保涵盖了多种实际应用场景,同时在评估中采用了响应内容和语音质量的综合测量。

🖼️ 关键图片

fig_0
fig_1
fig_2

📊 实验亮点

实验结果显示,开源模型在某些任务上表现优于专有模型,尤其是中型模型Step-Audio-2-mini(7B)在听力准确性上超过了LLaMA-Omni2-32B-Bilingual,达到了两倍以上的准确率。此外,当前模型在多模态输入和角色扮演任务上仍面临挑战,显示出进一步改进的空间。

🎯 应用场景

VoiceAssistant-Eval的潜在应用领域包括智能家居、客服机器人和教育辅助等多个场景。通过对AI助手的全面评估,该基准可以帮助开发者识别模型的不足之处,从而推动下一代AI助手的改进与创新,提升用户体验。

📄 摘要(原文)

The growing capabilities of large language models and multimodal systems have spurred interest in voice-first AI assistants, yet existing benchmarks are inadequate for evaluating the full range of these systems' capabilities. We introduce VoiceAssistant-Eval, a comprehensive benchmark designed to assess AI assistants across listening, speaking, and viewing. VoiceAssistant-Eval comprises 10,497 curated examples spanning 13 task categories. These tasks include natural sounds, music, and spoken dialogue for listening; multi-turn dialogue, role-play imitation, and various scenarios for speaking; and highly heterogeneous images for viewing. To demonstrate its utility, we evaluate 21 open-source models and GPT-4o-Audio, measuring the quality of the response content and speech, as well as their consistency. The results reveal three key findings: (1) proprietary models do not universally outperform open-source models; (2) most models excel at speaking tasks but lag in audio understanding; and (3) well-designed smaller models can rival much larger ones. Notably, the mid-sized Step-Audio-2-mini (7B) achieves more than double the listening accuracy of LLaMA-Omni2-32B-Bilingual. However, challenges remain: multimodal (audio plus visual) input and role-play voice imitation tasks are difficult for current models, and significant gaps persist in robustness and safety alignment. VoiceAssistant-Eval identifies these gaps and establishes a rigorous framework for evaluating and guiding the development of next-generation AI assistants. Code and data will be released at https://mathllm.github.io/VoiceAssistantEval/ .