Modality Collapse as Mismatched Decoding: Information-Theoretic Limits of Multimodal LLMs

作者: Jayadev Billa

分类: cs.CL, cs.AI, cs.LG

发布日期: 2026-02-28

💡 一句话要点

提出多模态LLM解码不匹配问题以提升信息提取能力

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 多模态LLM 解码器不匹配 广义互信息 情感分析 信息提取 机器学习 人工智能

📋 核心要点

现有多模态LLM在提取语音和图像信息时存在解码器不匹配的问题，导致信息提取能力受限。
论文提出通过去除模态特定方差来改善解码器性能，并引入广义互信息理论框架进行分析。
实验结果表明，情感目标的引入使情感信息的可访问性提高了7.5%，验证了训练目标的重要性。

📝 摘要（中文）

多模态大型语言模型（LLMs）能够处理语音和图像，但无法有效提取说话者的声音或物体的纹理信息。研究表明，这并不是编码失败，而是解码器在处理信息时存在不匹配的问题。通过去除64-71%的模态特定方差，解码器的损失得到了改善，表明解码器未能有效利用这些信息方向。论文将这一现象形式化为解码器不匹配问题，并提出了一种基于广义互信息（GMI）的理论框架，验证了这一理论在多个模型上的适用性。通过实验，发现训练目标决定了信息的可访问性，情感目标的引入显著提高了情感信息的提取能力。

🔬 方法详解

问题定义：论文旨在解决多模态LLM在解码过程中存在的信息提取不匹配问题。现有方法未能有效利用模态特定信息，导致解码器性能受限。

核心思路：通过去除模态特定方差，论文提出解码器在信息提取时的有效性可以得到提升。引入广义互信息（GMI）理论框架，分析解码器的性能限制。

技术框架：整体架构包括编码器和解码器两个主要模块。编码器负责处理输入的多模态信息，而解码器则根据训练目标提取相关信息。

关键创新：论文的主要创新在于将解码器不匹配问题形式化，并通过广义互信息理论分析其性能限制。这一理论框架适用于不同的模型架构。

关键设计：实验中采用了不同的训练目标，特别是情感目标的引入，显著提高了情感信息的提取能力，验证了训练目标对信息可访问性的影响。具体参数设置和损失函数设计未详细披露。

🖼️ 关键图片

📊 实验亮点

实验结果显示，通过去除64-71%的模态特定方差，解码器损失显著降低，且情感目标的引入使情感信息的可访问性提高了7.5%。这一结果表明，解码器的评分规则是性能瓶颈，而非编码器或投影方式。

🎯 应用场景

该研究的潜在应用领域包括多模态信息检索、情感分析和人机交互等。通过提升多模态LLM的解码能力，可以在实际应用中更准确地理解和处理复杂的多模态数据，进而推动智能助手、自动翻译和内容生成等技术的发展。

📄 摘要（原文）

Multimodal LLMs can process speech and images, but they cannot hear a speaker's voice or see an object's texture. We show this is not a failure of encoding: speaker identity, emotion, and visual attributes survive through every LLM layer (3--55$\times$ above chance in linear probes), yet removing 64--71% of modality-specific variance improves decoder loss. The decoder has no learned use for these directions; their presence is noise.We formalize this as a mismatched decoder problem: a decoder trained on text can only extract information along text-aligned directions. Accessible information is bounded by the Generalized Mutual Information (GMI), with degradation scaling with distributional distance and decoder sensitivity. The bound is a property of the decoder's scoring rule, not of any particular architecture; it applies whether non-text inputs arrive through a learned projection, a discrete codebook, or no explicit adapter at all. We validate this across five models spanning speech and vision. A controlled experiment (two Prismatic VLMs differing only in encoder text-alignment) confirms the bottleneck is the decoder's scoring rule, not the encoder or projection. A LoRA intervention demonstrates the fix: training with an emotion objective improves emotion accessibility ($+$7.5%) without affecting other attributes, confirming that the training objective determines what becomes accessible.

Modality Collapse as Mismatched Decoding: Information-Theoretic Limits of Multimodal LLMs

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理