A Graph-Augmented knowledge Distillation based Dual-Stream Vision Transformer with Region-Aware Attention for Gastrointestinal Disease Classification with Explainable AI
作者: Md Assaduzzaman, Nushrat Jahan Oyshi, Eram Mahamud
分类: eess.IV, cs.CV
发布日期: 2025-12-24
💡 一句话要点
提出基于图增强知识蒸馏的双流Vision Transformer用于可解释的胃肠道疾病分类
🎯 匹配领域: 支柱二:RL算法与架构 (RL & Architecture)
关键词: 胃肠道疾病分类 知识蒸馏 双流网络 Vision Transformer Swin Transformer 可解释性AI 医学图像分析 Tiny-ViT
📋 核心要点
- 内窥镜和组织病理学图像的胃肠道疾病分类面临数据量大和类间差异细微的挑战。
- 提出一种基于知识蒸馏的双流Vision Transformer框架,利用教师模型指导学生模型学习。
- 实验表明,该框架在两个数据集上取得了优异的分类精度和AUC,并具有良好的可解释性。
📝 摘要(中文)
本研究提出了一种混合双流深度学习框架,该框架基于教师-学生知识蒸馏,用于从内窥镜和组织病理学图像中准确分类胃肠道疾病。高容量的教师模型集成了Swin Transformer的全局上下文推理和Vision Transformer的局部细粒度特征提取。学生网络采用紧凑的Tiny-ViT结构,通过软标签蒸馏继承教师的语义和形态知识,从而在效率和诊断准确性之间取得平衡。使用两个精心策划的无线胶囊内窥镜数据集,涵盖主要的胃肠道疾病类别,以确保平衡的表示并防止样本间偏差。所提出的框架在数据集1和数据集2上分别实现了0.9978和0.9928的准确率,平均AUC为1.0000,表明具有接近完美的判别能力。使用Grad-CAM、LIME和Score-CAM进行的可解释性分析证实,该模型的预测基于临床上重要的组织区域和病理学相关的形态学线索,验证了该框架的透明性和可靠性。Tiny-ViT展示了与基于Transformer的教师模型相当的诊断性能,同时降低了计算复杂度并提供了更快的推理速度,使其适用于资源受限的临床环境。总的来说,所提出的框架为AI辅助的胃肠道疾病诊断提供了一个稳健、可解释和可扩展的解决方案,为未来与临床实践兼容的智能内窥镜筛查铺平了道路。
🔬 方法详解
问题定义:论文旨在解决胃肠道疾病内窥镜图像分类问题,现有方法难以兼顾全局上下文信息和局部细粒度特征,且计算复杂度高,难以在资源受限的临床环境中应用。
核心思路:论文采用知识蒸馏的方法,利用一个高容量的教师模型提取全局和局部特征,然后将这些知识传递给一个轻量级的学生模型,从而在保证分类精度的同时降低计算复杂度。
技术框架:该框架包含一个教师模型和一个学生模型。教师模型是一个双流网络,由Swin Transformer和Vision Transformer组成,分别用于提取全局上下文信息和局部细粒度特征。学生模型是一个紧凑的Tiny-ViT结构。通过知识蒸馏,学生模型学习教师模型的软标签,从而继承教师模型的知识。
关键创新:该框架的关键创新在于双流教师模型的设计,它能够同时提取全局上下文信息和局部细粒度特征,从而提高分类精度。此外,采用知识蒸馏的方法,将教师模型的知识传递给学生模型,从而在降低计算复杂度的同时保持分类精度。
关键设计:教师模型采用Swin Transformer和Vision Transformer的组合,Swin Transformer用于提取全局上下文信息,Vision Transformer用于提取局部细粒度特征。学生模型采用Tiny-ViT结构,以降低计算复杂度。知识蒸馏采用软标签蒸馏,损失函数包括交叉熵损失和KL散度损失。
📊 实验亮点
该框架在两个无线胶囊内窥镜数据集上取得了显著的性能提升,在数据集1和数据集2上分别实现了0.9978和0.9928的准确率,平均AUC为1.0000。同时,Tiny-ViT学生模型在保持较高分类精度的前提下,降低了计算复杂度,提高了推理速度。
🎯 应用场景
该研究成果可应用于AI辅助的胃肠道疾病诊断,帮助医生更准确、高效地进行疾病筛查和诊断。该框架具有良好的可解释性和较低的计算复杂度,使其适用于资源受限的临床环境,有望推动智能内窥镜筛查的普及和应用,提升医疗诊断水平。
📄 摘要(原文)
The accurate classification of gastrointestinal diseases from endoscopic and histopathological imagery remains a significant challenge in medical diagnostics, mainly due to the vast data volume and subtle variation in inter-class visuals. This study presents a hybrid dual-stream deep learning framework built on teacher-student knowledge distillation, where a high-capacity teacher model integrates the global contextual reasoning of a Swin Transformer with the local fine-grained feature extraction of a Vision Transformer. The student network was implemented as a compact Tiny-ViT structure that inherits the teacher's semantic and morphological knowledge via soft-label distillation, achieving a balance between efficiency and diagnostic accuracy. Two carefully curated Wireless Capsule Endoscopy datasets, encompassing major GI disease classes, were employed to ensure balanced representation and prevent inter-sample bias. The proposed framework achieved remarkable performance with accuracies of 0.9978 and 0.9928 on Dataset 1 and Dataset 2 respectively, and an average AUC of 1.0000, signifying near-perfect discriminative capability. Interpretability analyses using Grad-CAM, LIME, and Score-CAM confirmed that the model's predictions were grounded in clinically significant tissue regions and pathologically relevant morphological cues, validating the framework's transparency and reliability. The Tiny-ViT demonstrated diagnostic performance with reduced computational complexity comparable to its transformer-based teacher while delivering faster inference, making it suitable for resource-constrained clinical environments. Overall, the proposed framework provides a robust, interpretable, and scalable solution for AI-assisted GI disease diagnosis, paving the way toward future intelligent endoscopic screening that is compatible with clinical practicality.