VoiceAssistant-Eval: Benchmarking AI Assistants across Listening, Speaking, and Viewing

作者: Ke Wang, Houxing Ren, Zimu Lu, Mingjie Zhan, Hongsheng Li

分类: cs.CL, cs.AI, cs.CV, cs.HC, cs.SD

发布日期: 2025-09-26

🔗 代码/项目: PROJECT_PAGE

💡 一句话要点

提出VoiceAssistant-Eval以评估多模态AI助手的能力

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 多模态评估 AI助手 语音识别 自然语言处理 开源模型 听说视能力 任务分类

📋 核心要点

现有的评估基准无法全面评估AI助手在听、说、视等多模态能力上的表现，存在显著的不足。
本文提出VoiceAssistant-Eval基准，涵盖多种任务，旨在全面评估AI助手的多模态能力。
实验结果显示，开源模型在某些任务上表现优于专有模型，且中型模型在听力准确性上超越了大型模型。

📝 摘要（中文）

随着大型语言模型和多模态系统能力的提升，语音优先的AI助手引起了广泛关注。然而，现有的评估基准无法全面评估这些系统的能力。为此，本文提出了VoiceAssistant-Eval，这是一个综合基准，旨在评估AI助手在听、说、视等方面的表现。该基准包含10,497个精心策划的示例，涵盖13个任务类别，包括自然声音、音乐和对话等。通过对21个开源模型和GPT-4o-Audio的评估，发现大多数模型在说话任务上表现优异，但在音频理解方面存在不足。本文为下一代AI助手的评估和发展提供了严格的框架。

🔬 方法详解

问题定义：本文旨在解决现有评估基准无法全面评估AI助手多模态能力的问题，尤其是在听、说、视等方面的不足。

核心思路：通过构建VoiceAssistant-Eval基准，涵盖多种任务和场景，提供一个全面的评估框架，以便更好地理解和提升AI助手的性能。

技术框架：VoiceAssistant-Eval包括10,497个示例，分为听觉、语言和视觉三大模块，涵盖自然声音、对话、角色扮演等多种任务。

关键创新：该基准的创新在于其全面性和多样性，能够同时评估AI助手在听、说、视等多个维度的能力，填补了现有评估工具的空白。

关键设计：在设计中，任务类别的选择和示例的策划是关键，确保涵盖了多种实际应用场景，同时在评估中采用了响应内容和语音质量的综合测量。

🖼️ 关键图片

📊 实验亮点

实验结果显示，开源模型在某些任务上表现优于专有模型，尤其是中型模型Step-Audio-2-mini（7B）在听力准确性上超过了LLaMA-Omni2-32B-Bilingual，达到了两倍以上的准确率。此外，当前模型在多模态输入和角色扮演任务上仍面临挑战，显示出进一步改进的空间。

🎯 应用场景

VoiceAssistant-Eval的潜在应用领域包括智能家居、客服机器人和教育辅助等多个场景。通过对AI助手的全面评估，该基准可以帮助开发者识别模型的不足之处，从而推动下一代AI助手的改进与创新，提升用户体验。

📄 摘要（原文）

The growing capabilities of large language models and multimodal systems have spurred interest in voice-first AI assistants, yet existing benchmarks are inadequate for evaluating the full range of these systems' capabilities. We introduce VoiceAssistant-Eval, a comprehensive benchmark designed to assess AI assistants across listening, speaking, and viewing. VoiceAssistant-Eval comprises 10,497 curated examples spanning 13 task categories. These tasks include natural sounds, music, and spoken dialogue for listening; multi-turn dialogue, role-play imitation, and various scenarios for speaking; and highly heterogeneous images for viewing. To demonstrate its utility, we evaluate 21 open-source models and GPT-4o-Audio, measuring the quality of the response content and speech, as well as their consistency. The results reveal three key findings: (1) proprietary models do not universally outperform open-source models; (2) most models excel at speaking tasks but lag in audio understanding; and (3) well-designed smaller models can rival much larger ones. Notably, the mid-sized Step-Audio-2-mini (7B) achieves more than double the listening accuracy of LLaMA-Omni2-32B-Bilingual. However, challenges remain: multimodal (audio plus visual) input and role-play voice imitation tasks are difficult for current models, and significant gaps persist in robustness and safety alignment. VoiceAssistant-Eval identifies these gaps and establishes a rigorous framework for evaluating and guiding the development of next-generation AI assistants. Code and data will be released at https://mathllm.github.io/VoiceAssistantEval/ .

VoiceAssistant-Eval: Benchmarking AI Assistants across Listening, Speaking, and Viewing

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理