| 1 |
Survey of Multimodal Geospatial Foundation Models: Techniques, Applications, and Challenges |
综述多模态地理空间基础模型,应对遥感图像分析的挑战。 |
foundation model multimodal |
|
|
| 2 |
A Survey on Efficient Vision-Language-Action Models |
对高效视觉-语言-动作模型(Efficient VLA)的综述,旨在降低计算和数据需求。 |
vision-language-action foundation model |
|
|
| 3 |
Towards Generalisable Foundation Models for 3D Brain MRI |
BrainFound:面向3D脑部MRI的通用Foundation模型,提升疾病检测与分割性能。 |
foundation model multimodal |
|
|
| 4 |
PISA-Bench: The PISA Index as a Multilingual and Multimodal Metric for the Evaluation of Vision-Language Models |
PISA-Bench:一个多语言多模态基准,用于评估视觉-语言模型 |
large language model multimodal |
|
|
| 5 |
PRISM-Bench: A Benchmark of Puzzle-Based Visual Tasks with CoT Error Detection |
PRISM-Bench:一个基于谜题的可解释多模态推理评测基准 |
large language model multimodal chain-of-thought |
|
|
| 6 |
Multitask Multimodal Self-Supervised Learning for Medical Images |
提出Medformer,用于医学图像多任务多模态自监督学习,减少对标注数据的依赖。 |
multimodal |
|
|
| 7 |
MMSD3.0: A Multi-Image Benchmark for Real-World Multimodal Sarcasm Detection |
提出MMSD3.0多图讽刺检测基准和CIRM模型,解决真实场景多图线索讽刺识别问题 |
multimodal |
|
|
| 8 |
AG-Fusion: adaptive gated multimodal fusion for 3d object detection in complex scenes |
提出自适应门控融合方法以解决复杂场景中的3D物体检测问题 |
multimodal |
|
|
| 9 |
Implicit Modeling for Transferability Estimation of Vision Foundation Models |
提出隐式迁移建模(ITM),高效评估视觉基础模型在下游任务的迁移能力。 |
foundation model |
|
|
| 10 |
Revisiting Multimodal Positional Encoding in Vision-Language Models |
提出多头旋转位置编码MHRoPE及其变体MRoPE-I,提升视觉-语言模型的多模态位置编码能力。 |
multimodal |
✅ |
|
| 11 |
LightFusion: A Light-weighted, Double Fusion Framework for Unified Multimodal Understanding and Generation |
LightFusion:轻量级双重融合框架,用于统一多模态理解与生成 |
multimodal |
|
|
| 12 |
DynaStride: Dynamic Stride Windowing with MMCoT for Instructional Multi-Scene Captioning |
DynaStride:结合MMCoT的动态步长窗口化方法,用于生成教学视频的多场景字幕。 |
multimodal chain-of-thought |
|
|
| 13 |
PixelRefer: A Unified Framework for Spatio-Temporal Object Referring with Arbitrary Granularity |
提出PixelRefer,一个统一的区域级MLLM框架,用于任意粒度的时空对象指代理解。 |
large language model multimodal |
|
|
| 14 |
On the Faithfulness of Visual Thinking: Measurement and Enhancement |
提出SCCM学习策略,提升视觉语言模型多模态推理中视觉信息的可靠性和充分性。 |
multimodal chain-of-thought |
✅ |
|
| 15 |
CountFormer: A Transformer Framework for Learning Visual Repetition and Structure in Class-Agnostic Object Counting |
CountFormer:Transformer框架学习视觉重复与结构,实现类别无关的目标计数 |
foundation model |
|
|
| 16 |
A Video Is Not Worth a Thousand Words |
提出基于Shapley值的特征归因和模态评分方法,评估VLM在VQA任务中的文本依赖性。 |
large language model |
✅ |
|
| 17 |
The Underappreciated Power of Vision Models for Graph Structural Understanding |
利用视觉模型进行图结构理解,性能媲美图神经网络,并揭示其全局感知优势 |
foundation model |
|
|