| 1 |
GroundingME: Exposing the Visual Grounding Gap in MLLMs through Multi-Dimensional Evaluation |
GroundingME:多维度评测揭示MLLM在视觉定位能力上的差距 |
large language model multimodal visual grounding |
|
|
| 2 |
Adversarial Robustness of Vision in Open Foundation Models |
研究揭示开放视觉基础模型在对抗攻击下的脆弱性,并发现鲁棒性与基准性能不直接相关。 |
foundation model |
|
|
| 3 |
PathFLIP: Fine-grained Language-Image Pretraining for Versatile Computational Pathology |
PathFLIP:用于多功能计算病理学的细粒度语言-图像预训练 |
large language model multimodal instruction following |
|
|
| 4 |
Foundation Model Priors Enhance Object Focus in Feature Space for Source-Free Object Detection |
FALCON-SFOD:利用先验知识增强源域无关目标检测中的目标聚焦 |
foundation model |
|
|
| 5 |
MULTIAQUA: A multimodal maritime dataset and robust training strategies for multimodal semantic segmentation |
提出MULTIAQUA多模态水面数据集,并设计鲁棒训练策略提升水面语义分割性能 |
multimodal |
|
|
| 6 |
HeadHunt-VAD: Hunting Robust Anomaly-Sensitive Heads in MLLM for Tuning-Free Video Anomaly Detection |
HeadHunt-VAD:在MLLM中寻找异常敏感头,实现免调优视频异常检测 |
large language model multimodal |
|
|
| 7 |
Robust-R1: Degradation-Aware Reasoning for Robust Visual Understanding |
提出Robust-R1框架,通过显式建模视觉退化提升多模态大模型在真实场景下的鲁棒性。 |
large language model multimodal |
|
|
| 8 |
A Benchmark for Ultra-High-Resolution Remote Sensing MLLMs |
提出RSHR-Bench以解决遥感超高分辨率视觉理解评估问题 |
large language model multimodal |
✅ |
|
| 9 |
Deep But Reliable: Advancing Multi-turn Reasoning for Thinking with Images |
提出DRIM模型,提升视觉语言模型在图像推理中的多轮自反思能力 |
multimodal chain-of-thought |
|
|
| 10 |
Keypoint Counting Classifiers: Turning Vision Transformers into Self-Explainable Models Without Training |
提出无需训练的Keypoint Counting Classifiers,将ViT转化为自解释模型 |
foundation model |
|
|
| 11 |
Auxiliary Descriptive Knowledge for Few-Shot Adaptation of Vision-Language Model |
提出辅助描述知识ADK,提升视觉-语言模型在少样本迁移学习中的性能 |
large language model |
|
|
| 12 |
ABE-CLIP: Training-Free Attribute Binding Enhancement for Compositional Image-Text Matching |
ABE-CLIP:免训练的属性绑定增强方法,提升组合图像-文本匹配性能 |
multimodal |
|
|