YOLOv10 with Kolmogorov-Arnold networks and vision-language foundation models for interpretable object detection and trustworthy multimodal AI in computer vision perception

作者: Marios Impraimakis, Daniel Vazquez, Feiyu Zhou

分类: cs.CV, cs.AI, cs.CL, cs.LG, cs.RO

发布日期: 2026-03-24

备注: 14 pages, 23 Figures, 6 Tables

💡 一句话要点

提出基于Kolmogorov-Arnold网络和视觉-语言模型的YOLOv10，用于可解释的目标检测和可信赖的多模态AI

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 目标检测 可解释性AI Kolmogorov-Arnold网络 置信度评估 多模态学习 YOLOv10 计算机视觉 可信赖AI

📋 核心要点

现有目标检测系统在视觉条件差或场景模糊时，置信度评分的可靠性不足，缺乏透明度。
利用Kolmogorov-Arnold网络建模YOLOv10检测的置信度，通过可视化特征影响来评估预测的可靠性。
实验表明，该框架能有效识别低信任预测，并结合BLIP模型构建轻量级多模态接口，提升系统可解释性。

📝 摘要（中文）

本文研究了一种新型Kolmogorov-Arnold网络框架的可解释目标检测能力。该方法旨在解决自动驾驶车辆感知等计算机视觉应用中的一个关键限制：在视觉退化或模糊场景中，系统对其置信度评分的可靠性缺乏透明度。为了解决这个问题，本文采用Kolmogorov-Arnold网络作为可解释的后验代理，使用七个几何和语义特征来建模YOLOv10检测的可靠性。Kolmogorov-Arnold网络的加性样条结构可以直接可视化每个特征的影响，从而产生平滑且透明的函数映射，揭示模型置信度何时得到充分支持，以及何时不可靠。在COCO数据集和巴斯大学校园图像上的实验表明，该框架能够准确识别模糊、遮挡或低纹理下的低信任预测，从而为过滤、审查或下游风险缓解提供可操作的见解。此外，一个自举语言-图像（BLIP）基础模型生成每个场景的描述性标题。该工具提供了一个轻量级多模态接口，而不影响可解释性层。最终系统提供具有可信置信度估计的可解释目标检测，为自主和多模态人工智能应用提供了一个强大的透明且实用的感知组件。

🔬 方法详解

问题定义：论文旨在解决目标检测模型在复杂视觉场景下置信度评估不可靠的问题。现有方法，如YOLO系列，在图像质量下降（模糊、遮挡）或纹理信息不足时，其置信度评分往往不能准确反映检测结果的真实可靠性，导致下游任务面临风险。

核心思路：论文的核心思路是利用Kolmogorov-Arnold网络（KAN）作为YOLOv10的后验代理模型，通过学习几何和语义特征与置信度之间的关系，来评估YOLOv10预测的可靠性。KAN的可加性结构使其能够可视化每个特征对置信度的影响，从而提供可解释性。

技术框架：整体框架包含三个主要模块：1) YOLOv10目标检测器，用于生成初始的检测结果和置信度评分；2) 特征提取模块，提取每个检测框的几何（如面积、宽高比）和语义（如类别概率）特征；3) Kolmogorov-Arnold网络，将提取的特征作为输入，预测检测结果的可靠性评分。BLIP模型用于生成场景描述，提供多模态信息。

关键创新：最重要的创新点在于使用Kolmogorov-Arnold网络作为可解释的置信度评估器。与传统的黑盒模型不同，KAN的加性结构允许直接观察每个输入特征对输出置信度的影响，从而提高了模型的可解释性和透明度。此外，结合BLIP模型，实现了轻量级多模态接口，增强了场景理解能力。

关键设计：KAN的输入特征包括七个几何和语义特征，具体选择可能需要根据具体应用场景进行调整。损失函数的设计目标是使KAN的预测结果与真实可靠性标签尽可能一致。BLIP模型采用预训练模型，无需额外训练，降低了计算成本。

🖼️ 关键图片

📊 实验亮点

实验结果表明，该框架能够有效识别模糊、遮挡或低纹理下的低信任预测。在COCO数据集和巴斯大学校园图像上的实验验证了该方法的有效性。通过可视化特征影响，可以深入了解模型置信度评估的依据，为模型改进提供依据。BLIP模型的加入，在不影响可解释性的前提下，增强了场景理解能力。

🎯 应用场景

该研究成果可应用于自动驾驶、机器人导航、智能监控等领域。通过提供可信赖的置信度评估，可以提高系统在复杂环境下的安全性和可靠性。此外，可解释性分析有助于发现模型潜在的缺陷，并为模型改进提供指导。多模态接口可以增强人机交互体验。

📄 摘要（原文）

The interpretable object detection capabilities of a novel Kolmogorov-Arnold network framework are examined here. The approach refers to a key limitation in computer vision for autonomous vehicles perception, and beyond. These systems offer limited transparency regarding the reliability of their confidence scores in visually degraded or ambiguous scenes. To address this limitation, a Kolmogorov-Arnold network is employed as an interpretable post-hoc surrogate to model the trustworthiness of the You Only Look Once (Yolov10) detections using seven geometric and semantic features. The additive spline-based structure of the Kolmogorov-Arnold network enables direct visualisation of each feature's influence. This produces smooth and transparent functional mappings that reveal when the model's confidence is well supported and when it is unreliable. Experiments on both Common Objects in Context (COCO), and images from the University of Bath campus demonstrate that the framework accurately identifies low-trust predictions under blur, occlusion, or low texture. This provides actionable insights for filtering, review, or downstream risk mitigation. Furthermore, a bootstrapped language-image (BLIP) foundation model generates descriptive captions of each scene. This tool enables a lightweight multimodal interface without affecting the interpretability layer. The resulting system delivers interpretable object detection with trustworthy confidence estimates. It offers a powerful tool for transparent and practical perception component for autonomous and multimodal artificial intelligence applications.

YOLOv10 with Kolmogorov-Arnold networks and vision-language foundation models for interpretable object detection and trustworthy multimodal AI in computer vision perception

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理