Mitigating Low-Level Visual Hallucinations Requires Self-Awareness: Database, Model and Training Strategy

📄 arXiv: 2503.20673v2 📥 PDF

作者: Yinan Sun, Xiongkuo Min, Zicheng Zhang, Yixuan Gao, Yuqin Cao, Guangtao Zhai

分类: cs.CV

发布日期: 2025-03-26 (更新: 2025-03-27)


💡 一句话要点

提出SAFEQA模型和ESA-PO框架,缓解多模态大模型在底层视觉任务中的幻觉问题

🎯 匹配领域: 支柱九:具身大模型 (Embodied Foundation Models)

关键词: 多模态大模型 低级视觉 幻觉缓解 自我认知 图像质量评估

📋 核心要点

  1. 多模态大模型在低级视觉任务中易产生幻觉,现有研究对此关注不足,尤其是在图像质量评估方面。
  2. 论文提出SAFEQA模型和ESA-PO框架,利用图像特征和质量特征,增强模型对知识边界的认知,减少幻觉。
  3. 实验表明,该方法显著提升了模型在低级视觉任务中的自我认知,减少了幻觉,并在多个指标上优于闭源模型。

📝 摘要(中文)

多模态大语言模型在视觉感知和理解方面取得了显著进展,但容易产生幻觉,限制了其可靠性。本文关注低级视觉感知与理解(HLPU)中的幻觉问题,特别是图像质量评估任务。作者认为幻觉源于模型缺乏清晰的自我认知。为此,首先构建了首个专注于低级视觉任务幻觉的HLPU指令数据库,包含约20万个问答对。然后,提出了Self-Awareness Failure Elimination (SAFEQA)模型,利用图像特征、显著区域特征和质量特征来提高模型在低级视觉任务中的感知和理解能力。此外,提出了Enhancing Self-Awareness Preference Optimization (ESA-PO)框架,以增强模型对知识边界的认知,从而减少幻觉的发生。实验结果表明,该方法显著提高了模型在低级视觉任务中的自我认知,并减少了幻觉。

🔬 方法详解

问题定义:论文旨在解决多模态大语言模型在低级视觉感知与理解(HLPU)任务中产生的幻觉问题。现有方法主要集中在自然语言处理和图像描述领域,对HLPU任务中的幻觉问题研究不足。特别是在图像质量评估等任务中,模型容易产生与图像内容不符的错误判断,影响了其可靠性。

核心思路:论文的核心思路是通过增强模型的自我认知能力来减少幻觉。具体来说,模型需要明确自身知识的边界,避免在不确定的情况下做出错误的推断。为此,论文设计了SAFEQA模型和ESA-PO框架,分别从模型结构和训练策略两个方面入手,提升模型的自我认知能力。

技术框架:整体框架包含两个主要部分:SAFEQA模型和ESA-PO训练框架。SAFEQA模型利用图像特征、显著区域特征和质量特征,通过多模态融合的方式提升模型对图像内容的理解。ESA-PO框架则通过偏好优化,引导模型学习区分已知和未知,从而增强自我认知。

关键创新:论文的关键创新在于将自我认知引入到低级视觉任务的幻觉缓解中。通过构建HLPU指令数据库,为模型提供了学习自我认知的训练数据。SAFEQA模型和ESA-PO框架的设计,使得模型能够更好地利用图像特征和质量特征,区分已知和未知,从而减少幻觉的发生。

关键设计:SAFEQA模型的关键设计在于多模态特征融合,将图像特征、显著区域特征和质量特征进行有效整合。ESA-PO框架的关键设计在于偏好优化,通过设计合适的奖励函数,引导模型学习区分已知和未知。具体的参数设置和网络结构细节在论文中进行了详细描述,但此处未提供具体数值。

🖼️ 关键图片

fig_0
fig_1
fig_2

📊 实验亮点

实验结果表明,提出的SAFEQA模型和ESA-PO框架能够显著提高模型在低级视觉任务中的自我认知能力,并有效减少幻觉。在多个图像质量评估数据集上,该方法优于现有的基线方法,并在准确率和自我认知方面均取得了显著提升。具体性能数据和提升幅度在论文中进行了详细展示。

🎯 应用场景

该研究成果可应用于图像质量评估、图像修复、图像增强等领域。通过减少模型在低级视觉任务中的幻觉,可以提高人工智能系统的可靠性和安全性。未来,该方法有望推广到更广泛的视觉任务中,例如自动驾驶、医疗影像分析等。

📄 摘要(原文)

The rapid development of multimodal large language models has resulted in remarkable advancements in visual perception and understanding, consolidating several tasks into a single visual question-answering framework. However, these models are prone to hallucinations, which limit their reliability as artificial intelligence systems. While this issue is extensively researched in natural language processing and image captioning, there remains a lack of investigation of hallucinations in Low-level Visual Perception and Understanding (HLPU), especially in the context of image quality assessment tasks. We consider that these hallucinations arise from an absence of clear self-awareness within the models. To address this issue, we first introduce the HLPU instruction database, the first instruction database specifically focused on hallucinations in low-level vision tasks. This database contains approximately 200K question-answer pairs and comprises four subsets, each covering different types of instructions. Subsequently, we propose the Self-Awareness Failure Elimination (SAFEQA) model, which utilizes image features, salient region features and quality features to improve the perception and comprehension abilities of the model in low-level vision tasks. Furthermore, we propose the Enhancing Self-Awareness Preference Optimization (ESA-PO) framework to increase the model's awareness of knowledge boundaries, thereby mitigating the incidence of hallucination. Finally, we conduct comprehensive experiments on low-level vision tasks, with the results demonstrating that our proposed method significantly enhances self-awareness of the model in these tasks and reduces hallucinations. Notably, our proposed method improves both accuracy and self-awareness of the proposed model and outperforms close-source models in terms of various evaluation metrics.