| 1 |
ImageChain: Advancing Sequential Image-to-Text Reasoning in Multimodal Large Language Models |
ImageChain:通过多轮对话增强多模态大语言模型中的序列图像到文本推理能力 |
large language model multimodal |
|
|
| 2 |
Tell me why: Visual foundation models as self-explainable classifiers |
提出ProtoFM:结合视觉基础模型与原型架构的自解释分类器 |
foundation model |
✅ |
|
| 3 |
A Survey on Foundation-Model-Based Industrial Defect Detection |
综述:基于预训练模型(Foundation Model)的工业缺陷检测方法 |
foundation model |
|
|
| 4 |
CLIP-Optimized Multimodal Image Enhancement via ISP-CNN Fusion for Coal Mine IoVT under Uneven Illumination |
提出基于ISP-CNN融合和CLIP优化的多模态图像增强方法,用于煤矿IoVT低照度场景。 |
multimodal |
|
|
| 5 |
Improved YOLOv12 with LLM-Generated Synthetic Data for Enhanced Apple Detection and Benchmarking Against YOLOv11 and YOLOv10 |
利用LLM生成的合成数据改进YOLOv12,提升苹果检测性能并超越YOLOv11和YOLOv10 |
large language model |
|
|
| 6 |
FungalZSL: Zero-Shot Fungal Classification with Image Captioning Using a Synthetic Data Approach |
FungalZSL:利用合成数据和图像描述,实现真菌零样本分类 |
large language model |
|
|
| 7 |
Sherlock: Towards Multi-scene Video Abnormal Event Extraction and Localization via a Global-local Spatial-sensitive LLM |
提出Sherlock模型,用于多场景视频异常事件的抽取与定位。 |
large language model |
|
|