Omni-Captioner: Data Pipeline, Models, and Benchmark for Omni Detailed Perception

作者: Ziyang Ma, Ruiyang Xu, Zhenghao Xing, Yunfei Chu, Yuxuan Wang, Jinzheng He, Jin Xu, Pheng-Ann Heng, Kai Yu, Junyang Lin, Eng Siong Chng, Xie Chen

分类: cs.CL, cs.CV, cs.MM, cs.SD

发布日期: 2025-10-14

备注: https://github.com/ddlBoJack/Omni-Captioner

💡 一句话要点

提出Omni-Captioner，用于多模态细粒度感知，并构建相应的数据集、模型和评测基准。

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 多模态学习 细粒度感知 数据生成 幻觉抑制 字幕生成 音视频理解 Omni-Captioner

📋 核心要点

现有全模态语言模型在捕获和描述细粒度细节方面存在不足，同时容易产生幻觉。
提出Omni-Detective数据生成流程，利用工具调用自主生成高质量、低幻觉的多模态数据。
训练Audio-Captioner和Omni-Captioner模型，并在多个基准测试中取得领先或具有竞争力的性能。

📝 摘要（中文）

多模态信息的细粒度感知对于提升人机交互至关重要。随着音视频技术的进步，能够并行处理音频和视频信号的全模态语言模型（OLMs）已成为实现更丰富理解和推理的有前景的范例。然而，它们捕获和描述细粒度细节的能力仍有待探索。本文从数据流程、模型和基准测试的角度，对全模态细粒度感知进行了系统而全面的研究。我们首先发现当前OLMs中细节和幻觉之间存在固有的“共生”关系。为了解决这个问题，我们提出了Omni-Detective，一种集成工具调用的代理数据生成流程，以自主生成高度详细但幻觉最少的多模态数据。基于Omni-Detective生成的数据，我们训练了两个字幕模型：用于纯音频细粒度感知的Audio-Captioner和用于音视频细粒度感知的Omni-Captioner。在级联评估协议下，Audio-Captioner在MMAU和MMAR上实现了所有开源模型中的最佳性能，超过了Gemini 2.5 Flash，并达到了与Gemini 2.5 Pro相当的性能。在现有的细粒度字幕基准测试中，Omni-Captioner在VDC上创造了新的state-of-the-art，并在video-SALMONN 2测试集上实现了细节和幻觉之间的最佳权衡。鉴于缺乏专门用于全模态细粒度感知的基准测试，我们设计了Omni-Cloze，一种新颖的完形填空式评估方法，用于详细的音频、视频和音视频字幕，确保稳定、高效和可靠的评估。实验结果和分析表明，Omni-Detective在生成高质量的详细字幕方面是有效的，并且Omni-Cloze在评估此类详细字幕方面具有优越性。

🔬 方法详解

问题定义：论文旨在解决全模态（音频、视频）场景下，现有语言模型在细粒度感知和描述能力上的不足，以及由此带来的幻觉问题。现有方法要么无法充分捕捉细节，要么在生成详细描述时容易出现不准确或虚假信息。

核心思路：论文的核心思路是构建一个能够生成高质量、低幻觉多模态数据的数据生成流程Omni-Detective，并基于此训练专门的字幕模型。通过高质量的数据，提升模型对细节的感知能力，同时降低幻觉的产生。

技术框架：整体框架包含数据生成、模型训练和评估三个主要阶段。数据生成阶段使用Omni-Detective，该流程集成了工具调用，能够自主地从各种来源获取信息，并生成详细的字幕。模型训练阶段分别训练Audio-Captioner（纯音频）和Omni-Captioner（音视频）两个模型。评估阶段使用现有的基准测试以及新提出的Omni-Cloze评估方法。

关键创新：最重要的技术创新点在于Omni-Detective数据生成流程。它通过引入工具调用机制，使得数据生成过程更加自动化和可控，能够有效地生成高质量、低幻觉的多模态数据。与传统的人工标注或简单的自动生成方法相比，Omni-Detective能够更好地平衡细节的丰富度和信息的准确性。

关键设计：Omni-Detective的具体实现细节未知，但根据描述，其关键在于如何设计工具调用策略，以及如何利用这些工具获取的信息来生成高质量的字幕。Omni-Cloze评估方法的设计关键在于如何构建完形填空式的题目，以有效地评估模型对细节的感知能力。具体的模型结构和损失函数等细节在摘要中未提及，属于未知信息。

🖼️ 关键图片

📊 实验亮点

Audio-Captioner在MMAU和MMAR上超越Gemini 2.5 Flash，达到Gemini 2.5 Pro的水平。Omni-Captioner在VDC上取得SOTA，并在video-SALMONN 2上实现了细节和幻觉的最佳平衡。Omni-Cloze评估方法能够稳定、高效、可靠地评估细粒度字幕。

🎯 应用场景

该研究成果可应用于智能助手、视频内容理解、辅助驾驶等领域。例如，智能助手可以更准确地理解用户的语音指令和周围环境，从而提供更个性化的服务。在视频内容理解方面，可以自动生成更详细的视频描述，方便用户检索和理解视频内容。在辅助驾驶领域，可以更准确地感知周围的交通状况，提高驾驶安全性。

📄 摘要（原文）

Fine-grained perception of multimodal information is critical for advancing human-AI interaction. With recent progress in audio-visual technologies, Omni Language Models (OLMs), capable of processing audio and video signals in parallel, have emerged as a promising paradigm for achieving richer understanding and reasoning. However, their capacity to capture and describe fine-grained details remains limited explored. In this work, we present a systematic and comprehensive investigation of omni detailed perception from the perspectives of the data pipeline, models, and benchmark. We first identify an inherent "co-growth" between detail and hallucination in current OLMs. To address this, we propose Omni-Detective, an agentic data generation pipeline integrating tool-calling, to autonomously produce highly detailed yet minimally hallucinatory multimodal data. Based on the data generated with Omni-Detective, we train two captioning models: Audio-Captioner for audio-only detailed perception, and Omni-Captioner for audio-visual detailed perception. Under the cascade evaluation protocol, Audio-Captioner achieves the best performance on MMAU and MMAR among all open-source models, surpassing Gemini 2.5 Flash and delivering performance comparable to Gemini 2.5 Pro. On existing detailed captioning benchmarks, Omni-Captioner sets a new state-of-the-art on VDC and achieves the best trade-off between detail and hallucination on the video-SALMONN 2 testset. Given the absence of a dedicated benchmark for omni detailed perception, we design Omni-Cloze, a novel cloze-style evaluation for detailed audio, visual, and audio-visual captioning that ensures stable, efficient, and reliable assessment. Experimental results and analysis demonstrate the effectiveness of Omni-Detective in generating high-quality detailed captions, as well as the superiority of Omni-Cloze in evaluating such detailed captions.

Omni-Captioner: Data Pipeline, Models, and Benchmark for Omni Detailed Perception

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理