MarkushGrapher-2: End-to-end Multimodal Recognition of Chemical Structures

作者: Tim Strohmeyer, Lucas Morin, Gerhard Ingmar Meijer, Valéry Weber, Ahmed Nassar, Peter Staar

分类: cs.CV

发布日期: 2026-03-30

备注: 15 pages, to be published in CVPR 2026

💡 一句话要点

提出MarkushGrapher-2，用于端到端多模态化学结构识别，显著提升识别精度。

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 多模态学习 化学结构识别 Markush结构 OCR 视觉文本布局编码 自回归生成 数据集构建

📋 核心要点

现有方法在识别多模态化学结构（Markush结构）方面精度不足，无法用于大规模自动处理。
MarkushGrapher-2通过联合编码文本、图像和布局信息，并采用两阶段训练策略进行有效融合。
实验结果表明，该方法在多模态Markush结构识别方面显著优于现有技术，并保持了分子结构识别的性能。

📝 摘要（中文）

本文提出MarkushGrapher-2，一种端到端方法，用于文档中多模态化学结构（Markush结构）的识别。该方法首先利用专用OCR模型从化学图像中提取文本，然后通过Vision-Text-Layout编码器和光学化学结构识别视觉编码器联合编码文本、图像和布局信息。最终，通过两阶段训练策略有效融合这些编码，并自回归地生成Markush结构的表示。为解决训练数据不足的问题，我们引入了一个自动pipeline来构建大规模真实Markush结构数据集。此外，我们提出了IP5-M，一个大型手动标注的真实Markush结构基准，旨在推进对这项具有挑战性任务的研究。大量实验表明，我们的方法在多模态Markush结构识别方面显著优于最先进的模型，同时在分子结构识别方面保持了强大的性能。代码、模型和数据集已公开发布。

🔬 方法详解

问题定义：论文旨在解决从文档中自动提取多模态化学结构（Markush结构）的问题。现有方法在处理此类结构时，精度较低，无法满足大规模自动处理的需求。痛点在于如何有效地融合图像、文本和布局信息，并缺乏大规模的真实数据集进行训练。

核心思路：论文的核心思路是设计一个端到端的模型，能够同时处理图像、文本和布局信息，并利用大规模数据集进行训练。通过联合编码这些模态的信息，模型能够更准确地识别复杂的Markush结构。两阶段训练策略进一步提升了模型的性能。

技术框架：MarkushGrapher-2的整体框架包含以下几个主要模块：1) 专用OCR模型：用于从化学图像中提取文本信息。2) Vision-Text-Layout编码器：用于联合编码文本、图像和布局信息。3) 光学化学结构识别视觉编码器：用于提取图像中的化学结构信息。4) 两阶段训练策略：用于有效融合不同模态的编码信息。5) 自回归生成器：用于生成Markush结构的表示。

关键创新：论文的关键创新点在于：1) 提出了一个端到端的多模态Markush结构识别模型。2) 设计了一个Vision-Text-Layout编码器，能够有效地融合文本、图像和布局信息。3) 提出了一个两阶段训练策略，进一步提升了模型的性能。4) 构建了一个大规模的真实Markush结构数据集IP5-M。与现有方法相比，该方法能够更准确地识别复杂的Markush结构。

关键设计：论文的关键设计包括：1) 专用OCR模型的选择和训练，以确保准确提取图像中的文本信息。2) Vision-Text-Layout编码器的具体网络结构和参数设置，以实现有效的多模态信息融合。3) 两阶段训练策略的具体实现方式，例如损失函数的选择和优化算法的使用。4) 自回归生成器的解码策略，以生成准确的Markush结构表示。

🖼️ 关键图片

📊 实验亮点

MarkushGrapher-2在多模态Markush结构识别方面显著优于现有技术。在IP5-M数据集上，该方法取得了state-of-the-art的结果，相较于之前的最佳模型，性能提升了显著百分比（具体数值未在摘要中给出）。同时，该模型在分子结构识别方面也保持了强大的性能，证明了其通用性和有效性。

🎯 应用场景

该研究成果可广泛应用于化学文献的大规模自动分析，例如专利分析、药物发现和化学信息检索等领域。通过自动提取化学结构，可以加速相关研究进程，提高效率，并为化学领域的知识发现提供有力支持。未来，该技术有望应用于更复杂的化学结构识别任务，例如手绘化学结构的识别。

📄 摘要（原文）

Automatically extracting chemical structures from documents is essential for the large-scale analysis of the literature in chemistry. Automatic pipelines have been developed to recognize molecules represented either in figures or in text independently. However, methods for recognizing chemical structures from multimodal descriptions (Markush structures) lag behind in precision and cannot be used for automatic large-scale processing. In this work, we present MarkushGrapher-2, an end-to-end approach for the multimodal recognition of chemical structures in documents. First, our method employs a dedicated OCR model to extract text from chemical images. Second, the text, image, and layout information are jointly encoded through a Vision-Text-Layout encoder and an Optical Chemical Structure Recognition vision encoder. Finally, the resulting encodings are effectively fused through a two-stage training strategy and used to auto-regressively generate a representation of the Markush structure. To address the lack of training data, we introduce an automatic pipeline for constructing a large-scale dataset of real-world Markush structures. In addition, we present IP5-M, a large manually-annotated benchmark of real-world Markush structures, designed to advance research on this challenging task. Extensive experiments show that our approach substantially outperforms state-of-the-art models in multimodal Markush structure recognition, while maintaining strong performance in molecule structure recognition. Code, models, and datasets are released publicly.

MarkushGrapher-2: End-to-end Multimodal Recognition of Chemical Structures

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理