Improving Representation of High-frequency Components for Medical Visual Foundation Models

作者: Yuetan Chu, Yilan Zhang, Zhongyi Han, Changchun Yang, Longxi Zhou, Gongning Luo, Chao Huang, Xin Gao

分类: eess.IV, cs.AI, cs.CV

发布日期: 2024-07-19 (更新: 2025-03-03)

期刊: IEEE Transactions on Medical Imaging (2025)

DOI: 10.1109/TMI.2025.3559402

💡 一句话要点

提出Frepa，增强医学视觉基础模型对高频信息的表征能力，提升下游任务性能。

🎯 匹配领域: 支柱二：RL算法与架构 (RL & Architecture) 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 医学视觉 基础模型 自监督学习 高频信息 图像重建

📋 核心要点

现有医学视觉基础模型在高频信息和精细细节表征方面存在不足，限制了其在医学图像分析任务中的性能。
Frepa通过高频掩码、低频扰动和对抗学习，增强编码器对高频成分的表征和保留能力，提升模型性能。
实验表明，Frepa在多个医学图像任务上优于其他自监督预训练方法，尤其在精细细节任务上提升显著。

📝 摘要（中文）

医学视觉基础模型在下游任务中表现出良好的泛化能力，但对高频成分和精细细节的表征存在局限性。针对医学图像中复杂解剖结构、亚视觉特征和复杂边界对高频信息表征的迫切需求，本文提出了一种名为Frequency-advanced Representation Autoencoder (Frepa) 的新型预训练策略。Frepa通过高频掩码、低频扰动以及对抗学习，促使编码器有效地表征和保留图像嵌入中的高频成分。此外，本文还提出了一种直方图均衡化的图像掩码策略，将掩码自编码器方法扩展到Swin Transformer和卷积网络等架构。Frepa在九种医学模态上进行了开发，并在32个2D图像和3D体数据下游任务上进行了验证。实验结果表明，无需微调，Frepa即可优于其他自监督预训练方法，在某些情况下甚至超过了特定任务训练的模型。对于涉及精细细节的任务，性能提升尤为显著，例如视网膜血管分割的DSC提升高达+15%，肺结节检测的IoU提升高达+7%。量化实验进一步表明，Frepa能够实现卓越的高频表征和嵌入保留，突显了其在开发更通用医学图像基础模型方面的潜力。

🔬 方法详解

问题定义：医学图像中包含大量高频信息，如细微的血管、组织纹理等。现有视觉基础模型在高频信息表征方面存在不足，导致在医学图像分析任务中性能受限。现有方法难以有效区分和保留这些关键的高频细节，影响诊断精度。

核心思路：Frepa的核心思路是通过有针对性的预训练策略，迫使模型学习并有效表征图像中的高频成分。通过高频掩码和低频扰动，模型需要从受损的图像中重建原始图像，从而更加关注高频细节。对抗学习进一步增强了模型对高频信息的敏感性。

技术框架：Frepa的整体框架是一个自编码器结构，包括编码器和解码器。预训练阶段，输入图像首先经过直方图均衡化掩码处理，然后输入到编码器中提取特征。编码器输出的特征经过解码器重建图像。通过高频掩码、低频扰动和对抗学习，优化编码器和解码器的参数。

关键创新：Frepa的关键创新在于其针对高频信息的预训练策略。传统掩码自编码器通常采用随机掩码，而Frepa采用高频掩码，迫使模型关注高频细节。此外，Frepa还引入了低频扰动和对抗学习，进一步增强了模型对高频信息的表征能力。直方图均衡化掩码策略使得Frepa可以应用于不同的网络架构。

关键设计：Frepa的关键设计包括：1) 高频掩码：使用高通滤波器生成掩码，保留图像中的高频成分。2) 低频扰动：对图像的低频成分进行随机扰动，增加重建难度。3) 对抗学习：引入判别器，区分重建图像和真实图像，促使模型生成更逼真的高频细节。4) 直方图均衡化掩码：根据图像的直方图分布生成掩码，使得掩码更加均匀，避免信息集中在特定区域。

🖼️ 关键图片

📊 实验亮点

Frepa在32个下游任务上进行了验证，无需微调即可超越其他自监督预训练方法，在某些情况下甚至超过了特定任务训练的模型。在视网膜血管分割任务中，DSC提升高达+15%，在肺结节检测任务中，IoU提升高达+7%。实验结果表明，Frepa能够有效提升模型对高频信息的表征能力，从而提高下游任务的性能。

🎯 应用场景

Frepa可应用于各种医学图像分析任务，如疾病诊断、病灶检测、图像分割等。通过提升模型对高频信息的表征能力，可以提高诊断精度，辅助医生进行更准确的判断。该研究有助于开发更通用、更强大的医学视觉基础模型，推动医学影像分析的智能化发展。

📄 摘要（原文）

Foundation models have recently attracted significant attention for their impressive generalizability across diverse downstream tasks. However, these models are demonstrated to exhibit great limitations in representing high-frequency components and fine-grained details. In many medical imaging tasks, the precise representation of such information is crucial due to the inherently intricate anatomical structures, sub-visual features, and complex boundaries involved. Consequently, the limited representation of prevalent foundation models can result in significant performance degradation or even failure in these tasks. To address these challenges, we propose a novel pretraining strategy, named Frequency-advanced Representation Autoencoder (Frepa). Through high-frequency masking and low-frequency perturbation combined with adversarial learning, Frepa encourages the encoder to effectively represent and preserve high-frequency components in the image embeddings. Additionally, we introduce an innovative histogram-equalized image masking strategy, extending the Masked Autoencoder approach beyond ViT to other architectures such as Swin Transformer and convolutional networks. We develop Frepa across nine medical modalities and validate it on 32 downstream tasks for both 2D images and 3D volume data. Without fine-tuning, Frepa can outperform other self-supervised pretraining methods and, in some cases, even surpasses task-specific trained models. This improvement is particularly significant for tasks involving fine-grained details, such as achieving up to a +15% increase in DSC for retina vessel segmentation and a +7% increase in IoU for lung nodule detection. Further experiments quantitatively reveal that Frepa enables superior high-frequency representations and preservation in the embeddings, underscoring its potential for developing more generalized and universal medical image foundation models.

Improving Representation of High-frequency Components for Medical Visual Foundation Models

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理