Hybrid Knowledge Transfer through Attention and Logit Distillation for On-Device Vision Systems in Agricultural IoT

作者: Stanley Mugisha, Rashid Kisitu, Florence Tushabe

分类: cs.CV, cs.AI, cs.LG

发布日期: 2025-04-21

备注: 12 pages and 4 figures

💡 一句话要点

提出一种混合知识蒸馏框架，用于农业物联网中设备端视觉系统的优化。

🎯 匹配领域: 支柱二：RL算法与架构 (RL & Architecture)

关键词: 知识蒸馏 农业物联网 边缘计算 植物病害检测 视觉Transformer

📋 核心要点

现有方法难以兼顾视觉Transformer的高精度和边缘设备的低功耗需求，限制了其在农业物联网中的应用。
提出一种混合知识蒸馏框架，通过注意力对齐和logit蒸馏，将Swin Transformer的知识迁移到MobileNetV3。
实验表明，该方法在保持较高精度的同时，显著降低了计算量和推理延迟，更适合边缘设备部署。

📝 摘要（中文）

将深度学习应用集成到农业物联网系统中面临着在高精度Vision Transformer (ViT) 和资源受限边缘设备效率需求之间取得平衡的严峻挑战。Swin Transformer等大型Transformer模型通过捕获全局-局部依赖关系，在植物病害分类方面表现出色。然而，其计算复杂度（34.1 GFLOPs）限制了应用，使其不适合实时设备端推理。MobileNetV3和TinyML等轻量级模型适用于设备端推理，但缺乏细粒度疾病检测所需的空间推理能力。为了弥合这一差距，我们提出了一种混合知识蒸馏框架，该框架协同地将Swin Transformer教师模型的logit和注意力知识转移到MobileNetV3学生模型。我们的方法包括引入自适应注意力对齐来解决跨架构不匹配（分辨率、通道），以及优化类概率和空间焦点的双重损失函数。在lantVillage-Tomato数据集（18,160张图像）上，蒸馏后的MobileNetV3达到了92.4%的准确率，而Swin-L的准确率为95.9%，但PC上的计算量减少了95%，物联网设备上的推理延迟降低了<82%（PC CPU上为23ms，智能手机CPU上为86ms/图像）。主要创新包括以物联网为中心的验证指标（13 MB内存，0.22 GFLOPs）和动态分辨率匹配注意力图。对比实验表明，与独立的CNN和先前的蒸馏方法相比，有显著的改进，比MobileNetV3基线提高了3.5%的准确率。重要的是，这项工作推进了精准农业中实时、节能的作物监测，并展示了我们如何在边缘设备上获得ViT级别的诊断精度。代码和模型将在接收后提供。

🔬 方法详解

问题定义：论文旨在解决农业物联网中，如何在资源受限的边缘设备上部署高精度的视觉模型进行植物病害检测的问题。现有方法，如直接使用大型Transformer模型，计算量过大，无法满足边缘设备的实时性要求；而轻量级CNN模型，精度不足，难以进行细粒度的病害检测。

核心思路：论文的核心思路是利用知识蒸馏技术，将大型Transformer模型（教师模型）的知识迁移到轻量级CNN模型（学生模型），从而在保证精度的前提下，降低计算量和推理延迟。通过注意力对齐和logit蒸馏，使学生模型学习教师模型的空间推理能力和类别预测能力。

技术框架：整体框架包括以下几个主要模块：1) 教师模型（Swin Transformer）：负责提取图像特征并进行病害分类；2) 学生模型（MobileNetV3）：负责在边缘设备上进行推理；3) 注意力对齐模块：负责将教师模型的注意力图与学生模型的特征图进行对齐，解决跨架构不匹配问题；4) 损失函数：包括logit蒸馏损失和注意力对齐损失，用于优化学生模型的训练。

关键创新：论文的关键创新在于提出了自适应注意力对齐方法，能够有效解决不同架构（Transformer和CNN）之间的特征分辨率和通道数不匹配问题。此外，论文还设计了一种双重损失函数，同时优化类概率和空间注意力，从而提高学生模型的精度。

关键设计：注意力对齐模块采用动态分辨率匹配策略，根据教师模型和学生模型的特征图大小，自适应地调整注意力图的分辨率。损失函数采用logit蒸馏损失和注意力对齐损失的加权和，权重系数根据实验结果进行调整。MobileNetV3的网络结构采用标准的倒残差结构，并根据具体任务进行微调。

🖼️ 关键图片

📊 实验亮点

实验结果表明，该方法在lantVillage-Tomato数据集上，使MobileNetV3的准确率达到92.4%，仅比Swin-L的95.9%略低，但计算量减少了95%，推理延迟降低了82%。与MobileNetV3基线相比，准确率提高了3.5%。该方法在边缘设备上实现了ViT级别的诊断精度，验证了其有效性。

🎯 应用场景

该研究成果可应用于精准农业中的作物健康监测，通过在田间部署配备轻量级视觉模型的物联网设备，实现对植物病害的实时、自动检测和预警，从而帮助农民及时采取防治措施，减少损失，提高产量。此外，该方法还可推广到其他资源受限场景下的图像识别任务，例如智能安防、自动驾驶等。

📄 摘要（原文）

Integrating deep learning applications into agricultural IoT systems faces a serious challenge of balancing the high accuracy of Vision Transformers (ViTs) with the efficiency demands of resource-constrained edge devices. Large transformer models like the Swin Transformers excel in plant disease classification by capturing global-local dependencies. However, their computational complexity (34.1 GFLOPs) limits applications and renders them impractical for real-time on-device inference. Lightweight models such as MobileNetV3 and TinyML would be suitable for on-device inference but lack the required spatial reasoning for fine-grained disease detection. To bridge this gap, we propose a hybrid knowledge distillation framework that synergistically transfers logit and attention knowledge from a Swin Transformer teacher to a MobileNetV3 student model. Our method includes the introduction of adaptive attention alignment to resolve cross-architecture mismatch (resolution, channels) and a dual-loss function optimizing both class probabilities and spatial focus. On the lantVillage-Tomato dataset (18,160 images), the distilled MobileNetV3 attains 92.4% accuracy relative to 95.9% for Swin-L but at an 95% reduction on PC and < 82% in inference latency on IoT devices. (23ms on PC CPU and 86ms/image on smartphone CPUs). Key innovations include IoT-centric validation metrics (13 MB memory, 0.22 GFLOPs) and dynamic resolution-matching attention maps. Comparative experiments show significant improvements over standalone CNNs and prior distillation methods, with a 3.5% accuracy gain over MobileNetV3 baselines. Significantly, this work advances real-time, energy-efficient crop monitoring in precision agriculture and demonstrates how we can attain ViT-level diagnostic precision on edge devices. Code and models will be made available for replication after acceptance.

Hybrid Knowledge Transfer through Attention and Logit Distillation for On-Device Vision Systems in Agricultural IoT

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理