Visual Grounding of Whole Radiology Reports for 3D CT Images

作者: Akimichi Ichinose, Taro Hatsutani, Keigo Nakamura, Yoshiro Kitamura, Satoshi Iizuka, Edgar Simo-Serra, Shoji Kido, Noriyuki Tomiyama

分类: cs.CV

发布日期: 2023-12-08

备注: 14 pages, 7 figures. Accepted at MICCAI 2023

期刊: Medical Image Computing and Computer Assisted Intervention Lecture Notes in Computer Science 14224 (2023) 611-621

DOI: 10.1007/978-3-031-43904-9_59

💡 一句话要点

提出一种用于3D CT图像放射报告视觉定位框架，提升异常检测准确率。

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 视觉定位 医学影像 CT图像 放射报告 解剖分割 报告结构化 辅助诊断

📋 核心要点

现有医学图像识别系统依赖大规模标注数据，而人工标注成本高昂，视觉定位技术可以自动关联图像与报告，降低标注成本。
该论文提出一种结合解剖分割和报告结构化的视觉定位框架，利用器官掩码和结构化报告信息，提升定位准确性。
实验结果表明，该框架在包含10,410项研究的大规模数据集上，显著优于基线模型，定位准确率从66.0%提升至77.8%。

📝 摘要（中文）

构建大规模训练数据集是医学图像识别系统开发中的关键问题。视觉定位技术能够自动将图像中的对象与相应的描述关联起来，从而促进大量图像的标注。然而，CT图像放射报告的视觉定位仍然具有挑战性，因为CT成像可以检测到多种异常，并且生成的报告描述冗长而复杂。本文提出了一种专为CT图像和报告对设计的视觉定位框架，涵盖各种身体部位和不同的异常类型。该框架结合了1）图像的解剖分割和2）报告结构化两个组成部分。解剖分割提供给定CT图像的多个器官掩码，并帮助定位模型识别详细的解剖结构。报告结构化有助于准确提取关于相应报告中描述的每个异常的存在、位置和类型的信息。通过这两个额外的图像/报告特征，定位模型可以实现更好的定位。在验证过程中，我们构建了一个大规模数据集，其中包含7,321名独特患者的10,410项研究的区域-描述对应注释。我们使用定位准确率（正确本地化的异常百分比）作为指标评估了我们的框架，并证明了解剖分割和报告结构化的结合大大提高了性能（66.0％ vs 77.8％）。与现有技术的比较也表明了我们方法更高的性能。

🔬 方法详解

问题定义：论文旨在解决3D CT图像放射报告的视觉定位问题。现有方法难以处理CT图像中种类繁多的异常以及报告中冗长复杂的描述，导致定位精度不高。人工标注成本高昂，限制了医学图像识别系统的发展。

核心思路：论文的核心思路是结合解剖分割和报告结构化，为视觉定位模型提供更丰富的图像和文本信息。解剖分割提供器官掩码，帮助模型理解图像中的解剖结构；报告结构化提取异常的存在、位置和类型等信息，帮助模型理解报告中的关键内容。

技术框架：该框架包含两个主要模块：1) 解剖分割模块，用于生成CT图像的器官掩码；2) 报告结构化模块，用于提取报告中关于异常的信息。这两个模块的输出作为额外特征输入到视觉定位模型中，提升模型的定位能力。整体流程是：输入CT图像和放射报告，经过解剖分割和报告结构化处理，得到器官掩码和结构化报告信息，然后将这些信息输入到视觉定位模型中，最终输出异常的定位结果。

关键创新：该论文的关键创新在于将解剖分割和报告结构化相结合，为视觉定位模型提供更全面的信息。以往的视觉定位方法通常只关注图像和文本的原始信息，而忽略了医学图像的特殊性。通过引入解剖分割和报告结构化，该论文能够更好地利用医学图像的先验知识，从而提高定位精度。

关键设计：解剖分割模块的具体实现方式未知，报告结构化模块可能使用了自然语言处理技术来提取报告中的关键信息。视觉定位模型的具体结构也未知，但可以推测其使用了某种注意力机制来关联图像和文本信息。损失函数的设计目标是最小化预测的异常位置与真实位置之间的差异。

📊 实验亮点

实验结果表明，该框架在包含10,410项研究的大规模数据集上，显著优于基线模型，定位准确率从66.0%提升至77.8%。与现有技术的比较也表明了该方法更高的性能，证明了解剖分割和报告结构化对于CT图像放射报告视觉定位的有效性。

🎯 应用场景

该研究成果可应用于医学影像辅助诊断，帮助医生快速准确地定位CT图像中的异常，提高诊断效率和准确性。此外，该技术还可以用于构建大规模医学图像标注数据集，促进医学图像识别系统的发展，例如疾病自动筛查、病情评估等。

📄 摘要（原文）

Building a large-scale training dataset is an essential problem in the development of medical image recognition systems. Visual grounding techniques, which automatically associate objects in images with corresponding descriptions, can facilitate labeling of large number of images. However, visual grounding of radiology reports for CT images remains challenging, because so many kinds of anomalies are detectable via CT imaging, and resulting report descriptions are long and complex. In this paper, we present the first visual grounding framework designed for CT image and report pairs covering various body parts and diverse anomaly types. Our framework combines two components of 1) anatomical segmentation of images, and 2) report structuring. The anatomical segmentation provides multiple organ masks of given CT images, and helps the grounding model recognize detailed anatomies. The report structuring helps to accurately extract information regarding the presence, location, and type of each anomaly described in corresponding reports. Given the two additional image/report features, the grounding model can achieve better localization. In the verification process, we constructed a large-scale dataset with region-description correspondence annotations for 10,410 studies of 7,321 unique patients. We evaluated our framework using grounding accuracy, the percentage of correctly localized anomalies, as a metric and demonstrated that the combination of the anatomical segmentation and the report structuring improves the performance with a large margin over the baseline model (66.0% vs 77.8%). Comparison with the prior techniques also showed higher performance of our method.

Visual Grounding of Whole Radiology Reports for 3D CT Images

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册