MulTaBench: Benchmarking Multimodal Tabular Learning with Text and Image

📄 arXiv: 2605.10616v1 📥 PDF

作者: Alan Arazi, Eilam Shapira, Shoham Grunblat, Mor Ventura, Elad Hoffer, Gioia Blayer, David Holzmüller, Lennart Purucker, Gaël Varoquaux, Frank Hutter, Roi Reichart

分类: cs.LG, cs.CL, cs.CV

发布日期: 2026-05-11


💡 一句话要点

提出MulTaBench基准以解决多模态表格学习中非结构化数据表征对齐不足的问题

🎯 匹配领域: 支柱九:具身大模型 (Embodied Foundation Models)

关键词: 多模态学习 表格数据 表征学习 基准测试 基础模型 特征对齐

📋 核心要点

  1. 现有表格模型依赖冻结的预训练嵌入处理非结构化数据,无法充分利用模态间的互补信息,导致预测性能受限。
  2. 提出MulTaBench基准,通过构建40个高质量数据集,强调在多模态表格任务中进行目标感知表征微调的重要性。
  3. 实验证实,针对任务微调嵌入可跨多种架构和模态实现性能提升,为开发新型多模态表格基础模型奠定基础。

📝 摘要(中文)

表格基础模型通过预训练在监督式表格学习中取得了领先地位,但它们缺乏对文本和图像等非结构化模态的原生支持,通常依赖冻结的预训练嵌入。研究表明,针对特定任务微调这些嵌入能显著提升性能。然而,现有基准测试多关注模态的简单共现,导致数据集间方差大,掩盖了任务相关微调的优势。为此,本文提出了MulTaBench,包含40个涵盖图像-表格和文本-表格任务的数据集。该基准聚焦于模态间提供互补预测信号的场景,强调了“目标感知表征”(Target-Aware Representations)的必要性。实验证明,这种微调策略在不同模态、模型架构及参数规模下均具有普适性。MulTaBench是目前规模最大的图像-表格基准,旨在推动多模态表格基础模型的研究与发展。

🔬 方法详解

问题定义:现有表格学习方法在处理文本和图像等非结构化模态时,通常直接使用冻结的预训练特征提取器。这种做法忽略了通用嵌入与特定预测任务之间的语义鸿沟,导致模型无法捕捉到对下游任务至关重要的互补信息。

核心思路:论文提出“目标感知表征”(Target-Aware Representations)的概念,主张在多模态表格学习中,必须允许嵌入层根据下游任务进行微调,从而使非结构化模态的特征与表格数据更好地对齐。

技术框架:MulTaBench框架包含40个精心挑选的数据集,分为图像-表格和文本-表格两类。其核心流程是评估不同表格学习器(如TabTransformer, FT-Transformer等)在引入可训练嵌入层后的表现,并对比冻结嵌入与微调嵌入的性能差异。

关键创新:该研究首次系统性地揭示了现有基准测试中因模态共现定义模糊而掩盖微调收益的问题,并构建了大规模、高质量的基准数据集,明确了多模态表格学习中“任务对齐”的核心地位。

关键设计:设计上重点关注模态间的互补性,通过在不同编码器规模、嵌入维度和多种表格学习器上进行广泛实验,验证了目标感知微调的鲁棒性与泛化能力,为后续模型架构设计提供了统一的评估标准。

🖼️ 关键图片

fig_0
fig_1
fig_2

📊 实验亮点

MulTaBench是目前规模最大的图像-表格基准测试。实验结果表明,通过对嵌入层进行目标感知微调,模型在多种表格学习器和不同模态下均表现出显著的性能增益。该研究不仅量化了微调带来的提升,还证明了这种方法在不同编码器规模和嵌入维度下的广泛适用性,有效解决了以往基准测试中因高方差导致的评估偏差问题。

🎯 应用场景

该研究在医疗诊断(如结合患者电子病历文本与医学影像)、电子商务(如结合商品描述文本与产品图片进行销量预测)等领域具有重大价值。通过提升非结构化数据与表格数据的融合效率,该方法能显著增强复杂业务场景下的决策支持能力,推动多模态基础模型在工业界的落地应用。

📄 摘要(原文)

Tabular Foundation Models have recently established the state of the art in supervised tabular learning, by leveraging pretraining to learn generalizable representations of numerical and categorical structured data. However, they lack native support for unstructured modalities such as text and image, and rely on frozen, pretrained embeddings to process them. On established Multimodal Tabular Learning benchmarks, we show that tuning the embeddings to the task improves performance. Existing benchmarks, however, often focus on the mere co-occurrence of modalities; this leads to high variance across datasets and masks the benefits of task-specific tuning. To address this gap, we introduce MulTaBench, a benchmark of 40 datasets, split equally between image-tabular and text-tabular tasks. We focus on predictive tasks where the modalities provide complementary predictive signal, and where generic embeddings lose critical information, necessitating Target-Aware Representations that are aligned with the task. Our experimental results demonstrate that the gains from target-aware representation tuning generalize across both text and image modalities, several tabular learners, encoder scales, and embedding dimensions. MulTaBench constitutes the largest image-tabular benchmarking effort to date, spanning high-impact domains such as healthcare and e-commerce. It is designed to enable the research of novel architectures which incorporate joint modeling and target-aware representations, paving the way for the development of novel Multimodal Tabular Foundation Models.