MMAE: A Massive Multitask Audio Editing Benchmark
作者: Ziyang Ma, Ruiqi Yan, Ruiyang Xu, Jie Fang, Zhikang Niu, Yi-Wen Chao, Wenming Tu, Tianrui Wang, Auden, Qi Chen, Wenxi Chen, Jiaying Chi, Yanru Huo, Zixuan Jiang, Xiquan Li, Yalin Li, Junxi Liu, Minghao Liu, Binghao Qiang, Yijia Shan, Zheshu Song, Tian Tan, Zixiang Wang, Zeyu Xie, Zhifei Xie, Xiaoyu Xing, Qixiang Xu, Chen Yang, Guanrou Yang, Shan Yang, Yifan Yang, Steve Yves, Haotian Zhang, Haina Zhu, Kai Yu, Liefeng Bo, Eng-Siong Chng, Xie Chen
分类: cs.SD, cs.CL, cs.MM
发布日期: 2026-06-05
备注: Open-Source at https://github.com/ddlBoJack/MMAE
💡 一句话要点
提出MMAE基准以解决音频编辑评估不足问题
🎯 匹配领域: 支柱九:具身大模型 (Embodied Foundation Models)
关键词: 音频编辑 多任务学习 评估基准 智能创作 音频处理 复杂性分类 人机协作
📋 核心要点
- 现有音频编辑评估方法高度碎片化,缺乏全面的测试基准,无法有效支持多任务音频编辑的评估。
- MMAE基准通过涵盖七种音频模态和六个复杂性层次,提供了一个全面的评估框架,支持多种音频编辑任务。
- 实验结果显示,当前音频编辑系统在复杂任务中的表现不佳,准确匹配率低于5%,揭示了系统的关键瓶颈。
📝 摘要(中文)
我们介绍了MMAE,一个大规模多任务音频编辑基准,作为首个针对通用指令音频编辑的综合评估测试平台。随着智能创作的兴起,交互式编辑迅速从视觉领域扩展到音频。然而,现有评估基础设施严重滞后,且高度碎片化,局限于特定子领域或基本操作。MMAE涵盖七种不同的音频模态,建立了六个层次的任务复杂性分类,并通过人机协作精心策划了2000个高保真样本,配合创新的基于评分标准的评估框架。我们的评估显示,当前系统在执行精确编辑方面仍存在重大瓶颈,尤其在复杂的混合模态任务中,准确匹配率(EMR)低于5%。我们希望MMAE能为智能创作社区的未来进展提供清晰的诊断路线图。
🔬 方法详解
问题定义:论文旨在解决现有音频编辑评估方法的不足,特别是缺乏全面、系统的评估基准,导致无法有效评估多任务音频编辑的性能。
核心思路:MMAE基准通过建立一个涵盖多种音频模态和复杂性层次的评估框架,提供了一个系统化的评估方法,旨在提升音频编辑系统的评估标准。
技术框架:MMAE的整体架构包括七种音频模态、六个任务复杂性层次和八种操作类型,结合2000个高保真样本和17,741个可验证标准,形成一个多维度的评估体系。
关键创新:MMAE的主要创新在于其基于评分标准的评估框架,能够对指令遵循和上下文一致性进行精确评估,这在现有方法中是前所未有的。
关键设计:在设计中,MMAE通过人机协作精心策划样本,并采用了创新的评分标准,确保评估的多样性和准确性,同时涵盖了从基本修改到多轮编辑的复杂任务。
📊 实验亮点
实验结果显示,当前音频编辑系统在复杂任务中的准确匹配率(EMR)低于5%,在混合模态任务中甚至降至0%。这些结果揭示了现有系统在执行精确编辑和结构稳健性方面的重大瓶颈,为未来的研究指明了方向。
🎯 应用场景
MMAE基准的潜在应用领域包括智能音频编辑软件、音频内容创作平台以及教育和培训工具。其标准化的评估框架可以帮助开发者优化音频编辑算法,提高用户体验,并推动音频处理技术的进一步发展。
📄 摘要(原文)
We introduce MMAE, a Massive Multitask Audio Editing benchmark, serving as the first comprehensive evaluation testbed designed for general-purpose instruction-based audio editing. Spurred by the shift toward intelligent creation, interactive editing has rapidly expanded from visual domains, pioneered by models like Nano-banana 2 for images and Gemini-Omni for video, into audio. However, the current evaluation infrastructure lags severely, remaining highly fragmented and restricted to specific subdomains or basic operations. Unlike existing benchmarks that are limited in scope, MMAE extends to a broad spectrum of real-world scenarios, encompassing 7 distinct audio modalities, including sound, speech, music, and their mixtures. Furthermore, we establish a comprehensive taxonomy spanning 6 levels of task complexity, from basic modifications to multi-hop reasoning and multi-round editing, 2 levels of granularity, and 8 distinct operation types. Meticulously curated through human-agent collaboration, MMAE comprises 2,000 high-fidelity samples paired with a pioneering rubric-based evaluation framework. By decomposing free-form tasks into 17,741 verifiable criteria, this robust rubric-based paradigm enables a precise, multi-dimensional assessment of both instruction following and context consistency. Our extensive evaluation of leading models reveals that current systems remain far from achieving reliable edits. Strikingly, the Exact Match Rate (EMR) consistently falls below 5% and plummets to an absolute 0% in complex, mixed-modality tasks, exposing critical bottlenecks in precise execution and structural robustness. We hope MMAE will serve as a catalyst for future advances in the intelligent creation community, providing a clear diagnostic roadmap and establishing a standardized, long-lasting evaluation paradigm for next-generation audio editing systems.