MicroWorld: Empowering Multimodal Large Language Models to Bridge the Microscopic Domain Gap with Multimodal Attribute Graph

作者: Manyu Li, Ruian He, Chenxi Ma, Weimin Tan, Bo Yan

分类: cs.CV, cs.AI

发布日期: 2026-05-11

备注: 29 pages, 14 figures

🔗 代码/项目: GITHUB

💡 一句话要点

MicroWorld：通过多模态属性图增强MLLM在微观领域的推理能力

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 多模态大语言模型 知识图谱 科学推理 微观图像 知识注入

📋 核心要点

现有MLLM在微观领域因缺乏专业数据和专家知识而表现受限，难以进行精确推理。
MicroWorld构建多模态属性图，无需微调即可在推理时注入结构化知识，提升MLLM性能。
实验表明，MicroWorld显著提升了MLLM在MicroVQA和MicroBench基准上的性能，超越现有方法。

📝 摘要（中文）

多模态大型语言模型（MLLM）在科学推理方面展现出显著潜力，但其在微观等专业领域的性能受到领域特定训练数据稀缺以及难以将细粒度专家知识编码到模型参数中的限制。为了弥合这一差距，我们引入了MicroWorld，该框架从大规模科学图像-文本语料库构建多模态属性图（MAPG），并在推理时利用它来增强MLLM的推理能力，而无需任何领域特定的微调。MicroWorld通过scispaCy或基于LLM的三元组挖掘提取生物医学实体和关系，使用Qwen3-VL-Embedding在共享嵌入空间中对齐图像和实体，并构建包含约111K个节点和346K个类型化边的知识图谱，涵盖八个关系类别。在推理时，图增强检索管道将查询实体与MAPG匹配，并将结构化知识上下文注入MLLM提示。在MicroVQA基准测试中，MicroWorld将Qwen3-VL-8B-Instruct的推理性能提高了37.5%，超过GPT-5 13.0%，实现了新的state-of-the-art。此外，它在MicroBench基准测试中产生了6.0%的性能提升。广泛的实验证明了MicroWorld引入的增强的泛化能力。定性案例研究进一步揭示了结构化知识改进推理的机制，以及指出了有希望的未来方向的失败模式。

🔬 方法详解

问题定义：现有MLLM在处理微观图像相关的科学推理任务时，面临领域数据稀缺和专家知识难以有效嵌入的问题。这导致模型在理解图像细节、关联生物医学实体和进行复杂推理时表现不佳。现有方法通常需要大量的领域数据进行微调，成本高昂且泛化能力有限。

核心思路：MicroWorld的核心思路是构建一个多模态属性图（MAPG），将图像、实体和关系进行结构化表示，作为外部知识库增强MLLM的推理能力。通过在推理时检索相关知识并注入到MLLM的prompt中，可以在不进行领域特定微调的情况下提升模型性能。

技术框架：MicroWorld框架主要包含三个阶段：1) MAPG构建：从大规模科学图像-文本语料库中提取生物医学实体和关系，并使用Qwen3-VL-Embedding将图像和实体对齐到共享嵌入空间；2) 知识检索：根据输入查询中的实体信息，在MAPG中检索相关的知识子图；3) 知识注入：将检索到的知识子图以结构化的形式（例如三元组集合）注入到MLLM的prompt中，引导模型进行更准确的推理。

关键创新：MicroWorld的关键创新在于将外部知识图谱作为MLLM的推理增强器，通过知识检索和注入的方式，有效利用了领域知识，避免了对MLLM进行昂贵的微调。此外，多模态属性图的设计可以同时编码图像、实体和关系信息，实现更全面的知识表示。

关键设计：在MAPG构建阶段，使用scispaCy和基于LLM的三元组挖掘方法来提取实体和关系。在知识检索阶段，使用余弦相似度来衡量查询实体与MAPG中节点的相似度。在知识注入阶段，将检索到的三元组以自然语言的形式添加到MLLM的prompt中，例如“实体A与实体B的关系是关系C”。

🖼️ 关键图片

📊 实验亮点

MicroWorld在MicroVQA基准测试中将Qwen3-VL-8B-Instruct的推理性能提高了37.5%，超过了GPT-5 13.0%，达到了新的state-of-the-art。此外，在MicroBench基准测试中也取得了6.0%的性能提升。这些结果表明，MicroWorld能够显著增强MLLM在微观领域的推理能力。

🎯 应用场景

MicroWorld可应用于生物医学图像分析、药物发现、疾病诊断等领域。该框架能够帮助研究人员更好地理解微观图像，从中提取有价值的信息，并进行科学推理。通过将领域知识注入MLLM，MicroWorld有望加速相关领域的科学研究和技术创新。

📄 摘要（原文）

Multimodal large language models (MLLMs) show remarkable potential for scientific reasoning, yet their performance in specialized domains such as microscopy remains limited by the scarcity of domain-specific training data and the difficulty of encoding fine-grained expert knowledge into model parameters. To bridge the gap, we introduce MicroWorld, a framework that constructs a multimodal attributed property graph (MAPG) from large-scale scientific image--caption corpora and leverages it to augment MLLM reasoning at inference time without any domain-specific fine-tuning. MicroWorld extracts biomedical entities and relations via scispaCy or LLM-based triplet mining, aligns images and entities in a shared embedding space using Qwen3-VL-Embedding, and assembles a knowledge graph comprising approximately 111K nodes and 346K typed edges spanning eight relation categories. At inference time, a graph-augmented retrieval pipeline matches query entities to the MAPG and injects structured knowledge context into the MLLM prompt. On the MicroVQA benchmark, MicroWorld improves the reasoning performance of Qwen3-VL-8B-Instruct by 37.5%, outperforming GPT-5 by 13.0% to achieve a new state-of-the-art. Furthermore, it yields a 6.0% performance gain on the MicroBench benchmark. Extensive experiments demonstrate the enhanced generalization capability introduced by MicroWorld. A qualitative case study further reveals both the mechanisms through which structured knowledge improves reasoning and the failure modes that point to promising future directions. Code and data are available at https://github.com/ieellee/MicroWorld.

MicroWorld: Empowering Multimodal Large Language Models to Bridge the Microscopic Domain Gap with Multimodal Attribute Graph

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理