Compose and Fuse: Revisiting the Foundational Bottlenecks in Multimodal Reasoning
作者: Yucheng Wang, Yifan Hou, Aydin Javadov, Mubashara Akhtar, Mrinmaya Sachan
分类: cs.CL, cs.AI
发布日期: 2025-09-28
备注: Our code (https://github.com/DELTA-DoubleWise/OmniReason) and data (https://huggingface.co/datasets/ycwang11/OmniReason) are publicly available
💡 一句话要点
提出逻辑基础评估框架以解决多模态推理瓶颈问题
🎯 匹配领域: 支柱九:具身大模型 (Embodied Foundation Models)
关键词: 多模态推理 逻辑基础评估 任务组合瓶颈 融合瓶颈 注意力机制 推理路径 模型优化
📋 核心要点
- 现有多模态推理方法在模态交互的影响上存在矛盾,缺乏有效的评估框架。
- 提出逻辑基础的评估框架,分类六种交互模式,以分析模态间的推理支持或损害。
- 实验证明,额外模态在提供独立推理路径时提升性能,反之则会降低整体表现。
📝 摘要(中文)
多模态大型语言模型(MLLMs)通过整合文本、视觉和音频等多种输入,承诺增强推理能力。然而,跨模态推理仍然未得到充分探索,关于额外模态是否有助于或损害性能的报告存在矛盾。本文通过逻辑基础的评估框架,分类多模态推理的六种交互模式,实证发现,只有当额外模态提供独立且充分的推理路径时,才能增强推理能力。研究还识别出任务组合瓶颈和融合瓶颈,表明集成而非感知是多模态推理的主要障碍。
🔬 方法详解
问题定义:本文旨在解决多模态推理中的任务组合和融合瓶颈问题。现有方法在模态交互的影响上存在矛盾,缺乏有效的评估框架,导致无法明确何时模态交互支持或削弱推理能力。
核心思路:通过构建一个逻辑基础的评估框架,论文将多模态推理分为六种交互模式,分析模态间的推理支持或损害,从而识别出关键的瓶颈问题。
技术框架:整体架构包括两个主要阶段:第一阶段是识别模态的独立推理路径,第二阶段是对模态信号进行有效融合。通过两步提示(识别后推理)来恢复性能。
关键创新:最重要的技术创新在于识别出任务组合瓶颈和融合瓶颈,强调集成而非感知是多模态推理的主要障碍。与现有方法不同,本文关注模态间的有效交互和融合策略。
关键设计:在模型设计中,注意力模式未能有效编码事实的有用性,早期融合的注意力软化设计被提出,以改善推理效果。
🖼️ 关键图片
📊 实验亮点
实验结果表明,采用逻辑基础评估框架后,额外模态在提供独立推理路径时,推理性能显著提升,反之则降低整体表现。通过两步提示方法,恢复了模型的推理能力,验证了任务组合瓶颈的存在。
🎯 应用场景
该研究的潜在应用领域包括智能助手、自动驾驶、医疗影像分析等多模态任务。通过优化多模态推理能力,能够提升系统的决策质量和用户体验,未来可能在各类智能系统中发挥重要作用。
📄 摘要(原文)
Multimodal large language models (MLLMs) promise enhanced reasoning by integrating diverse inputs such as text, vision, and audio. Yet cross-modal reasoning remains underexplored, with conflicting reports on whether added modalities help or harm performance. These inconsistencies stem from a lack of controlled evaluation frameworks and analysis of models' internals to isolate when and why modality interactions support or undermine reasoning. We address this gap through a logic-grounded evaluation framework that categorizes multimodal reasoning into six interaction patterns, varying how facts are distributed across modalities and logically combined. Empirically, additional modalities enhance reasoning only when they provide independent and sufficient reasoning paths, while redundant or chained entailment support often hurts performance. Moreover, reasoning degrades in three systematic ways: weaker modalities drag down overall performance, conflicts bias preference toward certain modalities, and joint signals from different modalities fail to be integrated effectively. Therefore, we identify two core failures: task-composition bottleneck, where recognition and reasoning cannot be jointly executed in one pass, and fusion bottleneck, where early integration introduces bias. For further investigation, we find that attention patterns fail to encode fact usefulness, but a simple two-step prompting (recognize then reason) restores performance, confirming the task-composition bottleneck. Moreover, modality identity remains recoverable in early layers, and softening attention in early fusion improves reasoning, highlighting biased fusion as another failure mode. Overall, our findings show that integration, not perception, is the main barrier to multimodal reasoning, suggesting composition-aware training and early fusion control as promising directions.