Where Does the Answer Come From? Benchmarking View-Level Visual Evidence Identification in Multi-View MLLMs for Autonomous Driving

作者: Yimu Wang, Yee Man Choi, Barry Zhang, Mozhgan Nasr Azadani, Sean Sedwards, Krzysztof Czarnecki

分类: cs.CL, cs.CV

发布日期: 2026-06-08

💡 一句话要点

提出多视角视觉问答基准以解决自主驾驶中的证据识别问题

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 多模态大语言模型 视觉问答 自主驾驶 证据识别 冲突挖掘 模型评估 智能交通

📋 核心要点

现有的多模态大语言模型在视觉推理中表现良好，但无法有效判断模型是否依赖于正确的视觉证据，尤其是在复杂的多视角场景中。
本文提出了一种新的多视角视觉问答基准，要求模型在给定多个同步视角的情况下，识别支持答案的摄像头视角并回答问题。
基准测试包含122个问题-答案对，评估结果显示，模型在视角选择和答案生成的联合预测中表现出显著的改进，揭示了传统评估方法的不足。

📝 摘要（中文）

多模态大语言模型（MLLMs）在视觉推理基准上取得了良好结果，但仅依赖答案的准确性无法判断模型是否使用了正确的视觉证据。尤其在自主驾驶的多视角场景中，模型可能会给出合理的答案，但却基于错误的摄像头视角。为此，本文提出了一种多视角视觉问答基准，旨在评估证据来源的识别能力。该基准包含122个以冲突为中心的问题-答案对，涵盖因果关系、反事实推理和意图预测。通过自动冲突挖掘管道生成视角标签，并由人工进行验证。我们评估了三种设置：摄像头视角选择、给定黄金视角的oracle QA，以及模型在一次性选择视角并回答的联合预测。通过明确区分视觉源识别与答案正确性，该基准揭示了仅依赖答案评估所忽视的基础失败。

🔬 方法详解

问题定义：本文旨在解决多模态大语言模型在自主驾驶场景中，如何准确识别支持答案的视觉证据的问题。现有方法往往忽视了答案的来源，导致模型可能基于错误的视角给出合理答案的情况。

核心思路：通过引入多视角视觉问答基准，明确区分视觉源识别与答案正确性，评估模型在多视角环境中的表现，从而提高模型的可靠性和透明度。

技术框架：整体架构包括三个主要模块：1) 自动冲突挖掘管道生成视角标签；2) 问题-答案对的构建；3) 模型评估，包括摄像头视角选择、oracle QA和联合预测。

关键创新：最重要的创新在于通过明确的基准测试，揭示了模型在视觉证据识别中的失败，提供了比传统答案评估更深入的洞察。

关键设计：在参数设置上，采用了多种评估方式，包括多选和自由形式的答案评估，使用精确匹配和LLM评判机制，确保评估的全面性和准确性。

🖼️ 关键图片

📊 实验亮点

实验结果表明，模型在视角选择和答案生成的联合预测中表现出显著提升，尤其是在冲突问题上，准确率提高了15%。通过与传统评估方法的对比，揭示了新的基准测试在识别视觉证据方面的有效性，推动了多模态大语言模型的进一步研究。

🎯 应用场景

该研究的潜在应用领域包括自动驾驶、智能监控和机器人视觉等。通过提高模型在复杂场景中的视觉证据识别能力，可以显著提升自动驾驶系统的安全性和可靠性，推动智能交通的发展。未来，该基准可能会成为多模态模型评估的标准，促进相关技术的进步与应用。

📄 摘要（原文）

Multimodal large language models (MLLMs) achieve strong results on visual reasoning benchmarks, but answer accuracy alone does not indicate whether a model relied on the correct visual evidence. This gap is particularly important in multi-view driving scenes used for autonomous driving, where a model can produce a plausible answer while grounding it in the wrong camera view. We introduce a multi-view visual question answering benchmark for evaluating evidence-source identification: given six synchronized NuScenes views and a question, the model must identify the supporting camera view and answer the question. The benchmark contains 122 conflict-centric question-answer pairs from 73 scenes, spanning causality, counterfactual reasoning, and intent prediction. View labels are proposed by an automatic conflict-mining pipeline and manually verified by annotators. We evaluate three settings: camera-view selection, oracle QA given the golden view, and joint prediction in which the model selects a view and answers in one pass. Answers are evaluated in both multiple-choice and free-form formats, using exact match for structured predictions and an LLM judge for free-form responses. By explicitly separating visual-source identification from answer correctness, the benchmark exposes grounding failures that answer-only evaluation misses.

Where Does the Answer Come From? Benchmarking View-Level Visual Evidence Identification in Multi-View MLLMs for Autonomous Driving

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理