| 1 |
A Framework for a Capability-driven Evaluation of Scenario Understanding for Multimodal Large Language Models in Autonomous Driving |
提出一种能力驱动的评估框架,用于评估自动驾驶中多模态大语言模型对场景的理解能力 |
large language model multimodal |
|
|
| 2 |
Towards a Unified Copernicus Foundation Model for Earth Vision |
提出Copernicus-FM:统一的地球视觉基础模型,支持多模态遥感数据处理。 |
foundation model multimodal |
✅ |
|
| 3 |
VERIFY: A Benchmark of Visual Explanation and Reasoning for Investigating Multimodal Reasoning Fidelity |
提出VERIFY基准以评估多模态推理的视觉解释能力 |
large language model multimodal |
|
|
| 4 |
MMS-LLaMA: Efficient LLM-based Audio-Visual Speech Recognition with Minimal Multimodal Speech Tokens |
提出MMS-LLaMA以解决多模态语音识别中的计算效率问题 |
large language model multimodal |
|
|
| 5 |
BannerAgency: Advertising Banner Design with Multimodal LLM Agents |
提出BannerAgency,一个基于多模态LLM Agent的广告横幅全自动设计框架。 |
large language model multimodal |
|
|
| 6 |
Multimodal-Aware Fusion Network for Referring Remote Sensing Image Segmentation |
提出多模态感知融合网络MAFN,用于解决遥感图像的指代分割任务。 |
multimodal |
✅ |
|
| 7 |
Perceive, Understand and Restore: Real-World Image Super-Resolution with Autoregressive Multimodal Generative Models |
提出PURE模型,利用自回归多模态生成模型实现鲁棒的真实世界图像超分辨率 |
multimodal |
✅ |
|
| 8 |
Falcon: A Remote Sensing Vision-Language Foundation Model (Technical Report) |
Falcon:遥感领域的视觉-语言基础模型,实现多任务统一处理 |
foundation model |
✅ |
|
| 9 |
Towards Privacy-preserved Pre-training of Remote Sensing Foundation Models with Federated Mutual-guidance Learning |
提出FedSense框架,通过联邦互指导学习实现遥感基础模型的隐私保护预训练。 |
foundation model |
|
|
| 10 |
OmniDiff: A Comprehensive Benchmark for Fine-grained Image Difference Captioning |
OmniDiff:用于细粒度图像差异描述的综合基准,并提出M$^3$Diff模型。 |
large language model multimodal |
|
|
| 11 |
PARIC: Probabilistic Attention Regularization for Language Guided Image Classification from Pre-trained Vison Language Models |
PARIC:提出概率注意力正则化方法,提升预训练视觉语言模型在语言引导图像分类中的性能。 |
foundation model |
|
|
| 12 |
SpaceSeg: A High-Precision Intelligent Perception Segmentation Method for Multi-Spacecraft On-Orbit Targets |
SpaceSeg:针对在轨多航天器目标的高精度智能感知分割方法 |
foundation model |
✅ |
|
| 13 |
Solution for 8th Competition on Affective & Behavior Analysis in-the-wild |
提出一种基于音频-视觉多模态融合的AU检测方法,提升野外环境下的面部动作单元识别精度。 |
multimodal |
|
|
| 14 |
Pruning the Paradox: How CLIP's Most Informative Heads Enhance Performance While Amplifying Bias |
提出概念一致性分数(CCS),揭示CLIP模型性能与社会偏见之间的内在联系。 |
foundation model |
|
|
| 15 |
Augmenting Image Annotation: A Human-LMM Collaborative Framework for Efficient Object Selection and Label Generation |
提出人-LMM协作框架,提升图像标注效率,减轻标注疲劳 |
multimodal |
|
|
| 16 |
Fine-Grained Instruction-Guided Graph Reasoning for Vision-and-Language Navigation |
提出OIKG框架,通过细粒度指令引导的图推理提升视觉语言导航性能 |
VLN |
|
|