| 1 |
FPBench: A Comprehensive Benchmark of Multimodal Large Language Models for Fingerprint Analysis |
FPBench:首个用于指纹分析的多模态大语言模型综合基准 |
large language model foundation model multimodal |
|
|
| 2 |
GroundingME: Exposing the Visual Grounding Gap in MLLMs through Multi-Dimensional Evaluation |
GroundingME:通过多维度评估揭示多模态大语言模型在视觉定位方面的差距 |
large language model multimodal visual grounding |
|
|
| 3 |
Adversarial Robustness of Vision in Open Foundation Models |
研究表明视觉模态是开放域视觉语言模型(VLM)的有效攻击面,且模型鲁棒性与基准性能不直接相关。 |
foundation model |
|
|
| 4 |
PathFLIP: Fine-grained Language-Image Pretraining for Versatile Computational Pathology |
PathFLIP:用于多功能计算病理学的细粒度语言-图像预训练 |
large language model multimodal instruction following |
|
|
| 5 |
Foundation Model Priors Enhance Object Focus in Feature Space for Source-Free Object Detection |
FALCON-SFOD:利用基础模型先验增强源域无关目标检测中的目标聚焦 |
foundation model |
|
|
| 6 |
MULTIAQUA: A multimodal maritime dataset and robust training strategies for multimodal semantic segmentation |
提出MULTIAQUA多模态水域数据集,并探索稳健的多模态语义分割训练策略 |
multimodal |
|
|
| 7 |
HeadHunt-VAD: Hunting Robust Anomaly-Sensitive Heads in MLLM for Tuning-Free Video Anomaly Detection |
HeadHunt-VAD:在MLLM中寻找鲁棒的异常敏感头,实现免调优视频异常检测 |
large language model multimodal |
|
|
| 8 |
Robust-R1: Degradation-Aware Reasoning for Robust Visual Understanding |
提出Robust-R1框架,通过显式建模视觉退化实现鲁棒视觉理解 |
large language model multimodal |
|
|
| 9 |
A Benchmark for Ultra-High-Resolution Remote Sensing MLLMs |
提出RSHR-Bench:一个面向超高分辨率遥感多模态大语言模型的基准测试 |
large language model multimodal |
✅ |
|
| 10 |
Deep But Reliable: Advancing Multi-turn Reasoning for Thinking with Images |
DRIM:提升视觉语言模型在图像推理中的多轮自反思能力 |
multimodal chain-of-thought |
|
|
| 11 |
Keypoint Counting Classifiers: Turning Vision Transformers into Self-Explainable Models Without Training |
提出无需训练的Keypoint Counting Classifiers,将ViT转化为自解释模型 |
foundation model |
|
|
| 12 |
Auxiliary Descriptive Knowledge for Few-Shot Adaptation of Vision-Language Model |
提出辅助描述知识ADK,提升视觉-语言模型在少样本迁移学习中的性能 |
large language model |
|
|
| 13 |
ABE-CLIP: Training-Free Attribute Binding Enhancement for Compositional Image-Text Matching |
提出ABE-CLIP,无需训练增强CLIP模型在组合图像-文本匹配中的属性绑定能力 |
multimodal |
|
|