Evian: Towards Explainable Visual Instruction-tuning Data Auditing

📄 arXiv: 2604.20544v1 📥 PDF

作者: Zimu Jia, Mingjie Xu, Andrew Estornell, Jiaheng Wei

分类: cs.CV, cs.AI

发布日期: 2026-04-22

备注: Accepted at ACL 2026


💡 一句话要点

提出EVIAN框架以解决视觉指令调优数据审计问题

🎯 匹配领域: 支柱九:具身大模型 (Embodied Foundation Models)

关键词: 视觉语言模型 数据审计 逻辑连贯性 事实准确性 数据质量 深度学习 模型微调

📋 核心要点

  1. 现有数据集质量不一致,数据过滤方法无法细致识别逻辑谬误和事实错误,成为模型发展的瓶颈。
  2. 提出了EVIAN框架,通过分解模型响应为视觉描述、主观推理和事实声明,进行针对性分析。
  3. 实验结果显示,基于EVIAN策划的小型高质量数据集微调的模型性能超越了在更大数据集上训练的模型。

📝 摘要(中文)

大型视觉语言模型(LVLMs)的有效性严重依赖于训练数据的质量,现有数据集存在质量不一致的问题,且数据过滤方法无法细致识别语义缺陷。为此,本文提出三项核心贡献:构建了一个包含30万样本的基准数据集,通过注入多样化的缺陷进行数据审计;引入“分解-再评估”范式,分析模型响应的认知组成部分;实现了EVIAN框架,自动评估图像-文本一致性、逻辑连贯性和事实准确性。实验证明,基于EVIAN精心策划的小型高质量子集进行微调的模型性能优于在更大数据集上训练的模型。

🔬 方法详解

问题定义:本文旨在解决大型视觉语言模型训练数据质量不一致的问题,现有方法依赖粗粒度评分,无法识别细微的语义缺陷,如逻辑谬误和事实错误。

核心思路:提出“分解-再评估”范式,将模型响应分解为多个认知组成部分,以便进行更细致的分析和审计。

技术框架:EVIAN框架包括数据集构建、模型响应分解和评估三个主要模块,分别负责数据注入、认知组件分析和质量评估。

关键创新:EVIAN框架的核心创新在于其分解-再评估的分析方法,能够在图像-文本一致性、逻辑连贯性和事实准确性上进行全面评估,与传统方法相比,提供了更高的审计精度。

关键设计:在框架中,采用了针对性的损失函数和评估指标,确保每个认知组件的评估能够反映出数据质量的真实情况。

🖼️ 关键图片

fig_0
fig_1
fig_2

📊 实验亮点

实验结果表明,基于EVIAN框架策划的小型高质量数据集微调的模型在性能上显著优于在更大数据集上训练的模型,具体表现为在逻辑连贯性和事实准确性方面的提升,验证了逻辑连贯性在数据质量评估中的关键作用。

🎯 应用场景

该研究的潜在应用领域包括视觉语言模型的训练和优化、数据集的质量审计以及人工智能系统的可靠性提升。EVIAN框架可以为数据科学家和研究人员提供更有效的数据审计工具,从而推动更高质量的AI模型开发。

📄 摘要(原文)

The efficacy of Large Vision-Language Models (LVLMs) is critically dependent on the quality of their training data, requiring a precise balance between visual fidelity and instruction-following capability. Existing datasets, however, are plagued by inconsistent quality, and current data filtering methods rely on coarse-grained scores that lack the granularity to identify nuanced semantic flaws like logical fallacies or factual errors. This creates a fundamental bottleneck in developing more reliable models. To address this, we make three core contributions. First, we construct a large-scale, 300K-sample benchmark by systematically injecting diverse, subtle defects to provide a challenging testbed for data auditing. Second, we introduce a novel "Decomposition-then-Evaluation" paradigm that breaks model responses into constituent cognitive components: visual description, subjective inference, and factual claim, enabling targeted analysis. Third, we instantiate this paradigm via EVIAN (Explainable Visual Instruction-tuning Data AuditiNg), an automated framework that evaluates these components along the orthogonal axes of Image-Text Consistency, Logical Coherence, and Factual Accuracy. Our empirical findings challenge the prevailing scale-centric paradigm: a model fine-tuned on a compact, high-quality subset curated by EVIAN consistently surpassed models trained on orders-of-magnitude larger datasets. We also reveal that dividing complex auditing into verifiable subtasks enables robust curation, and that Logical Coherence is the most critical factor in data quality evaluation.