| 1 |
Survey of Multimodal Geospatial Foundation Models: Techniques, Applications, and Challenges |
综述多模态地理空间基础模型,应对遥感图像分析中的异构性与分布偏移。 |
foundation model multimodal |
|
|
| 2 |
A Survey on Efficient Vision-Language-Action Models |
对高效视觉-语言-动作模型进行综述,旨在弥合数字知识与物理世界交互的鸿沟。 |
vision-language-action VLA |
✅ |
|
| 3 |
Towards Generalisable Foundation Models for Brain MRI |
BrainFound:面向脑部MRI的通用可泛化基础模型 |
foundation model multimodal |
|
|
| 4 |
PISA-Bench: The PISA Index as a Multilingual and Multimodal Metric for the Evaluation of Vision-Language Models |
PISA-Bench:一个多语言多模态基准,用于评估视觉-语言模型 |
large language model multimodal |
|
|
| 5 |
PRISM-Bench: A Benchmark of Puzzle-Based Visual Tasks with CoT Error Detection |
PRISM-Bench:一个基于谜题的可视化任务基准,具备CoT错误检测能力 |
large language model multimodal chain-of-thought |
|
|
| 6 |
Multitask Multimodal Self-Supervised Learning for Medical Images |
提出Medformer,用于医学图像多任务多模态自监督学习,减少对标注数据的依赖。 |
multimodal |
|
|
| 7 |
MMSD3.0: A Multi-Image Benchmark for Real-World Multimodal Sarcasm Detection |
提出MMSD3.0多图讽刺检测基准和CIRM模型,解决真实场景多图线索讽刺识别问题 |
multimodal |
|
|
| 8 |
AG-Fusion: adaptive gated multimodal fusion for 3d object detection in complex scenes |
提出自适应门控多模态融合AG-Fusion,解决复杂场景下3D目标检测的鲁棒性问题 |
multimodal |
|
|
| 9 |
Implicit Modeling for Transferability Estimation of Vision Foundation Models |
提出隐式迁移建模(ITM)框架,提升视觉基础模型的可迁移性评估准确率和效率。 |
foundation model |
|
|
| 10 |
Revisiting Multimodal Positional Encoding in Vision-Language Models |
提出多头旋转位置编码MHRoPE及其变体,提升视觉-语言模型的多模态理解能力 |
multimodal |
✅ |
|
| 11 |
LightFusion: A Light-weighted, Double Fusion Framework for Unified Multimodal Understanding and Generation |
LightFusion:轻量级双重融合框架,用于统一多模态理解与生成 |
multimodal |
|
|
| 12 |
DynaStride: Dynamic Stride Windowing with MMCoT for Instructional Multi-Scene Captioning |
DynaStride:利用MMCoT和动态步长窗口解决教学视频多场景字幕生成问题 |
multimodal chain-of-thought |
|
|
| 13 |
PixelRefer: A Unified Framework for Spatio-Temporal Object Referring with Arbitrary Granularity |
提出PixelRefer,一个统一的区域级多模态大语言模型框架,用于任意粒度的时空对象指代。 |
large language model multimodal |
|
|
| 14 |
On the Faithfulness of Visual Thinking: Measurement and Enhancement |
提出SCCM学习策略,提升视觉语言模型多模态推理中视觉信息的可靠性和充分性。 |
multimodal chain-of-thought |
✅ |
|
| 15 |
CountFormer: A Transformer Framework for Learning Visual Repetition and Structure in Class-Agnostic Object Counting |
CountFormer:基于Transformer的无类别物体计数,学习视觉重复与结构 |
foundation model |
|
|
| 16 |
The Underappreciated Power of Vision Models for Graph Structural Understanding |
探索视觉模型在图结构理解中的潜力,并提出GraphAbstract基准测试。 |
foundation model |
|
|