| 1 |
Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models |
提出可控的图像描述生成流程,优化多模态预训练模型对不同描述格式的偏好。 |
foundation model multimodal |
|
|
| 2 |
Contrastive Localized Language-Image Pre-Training |
提出对比局部语言-图像预训练以提升视觉表示能力 |
large language model foundation model multimodal |
|
|
| 3 |
IC3M: In-Car Multimodal Multi-object Monitoring for Abnormal Status of Both Driver and Passengers |
提出IC3M,用于车载多模态多对象监控驾驶员和乘客的异常状态 |
multimodal |
|
|
| 4 |
A Foundation Model for the Solar Dynamics Observatory |
SDO-FM:用于太阳动力学观测台的多模态太阳物理基础模型 |
foundation model |
|
|
| 5 |
LLaVA-Video: Video Instruction Tuning With Synthetic Data |
LLaVA-Video:通过合成数据进行视频指令调优,提升视频多模态大模型性能。 |
multimodal instruction following |
|
|
| 6 |
Dog-IQA: Standard-guided Zero-shot MLLM for Mix-grained Image Quality Assessment |
提出Dog-IQA,一种标准引导的零样本混合粒度图像质量评估方法,利用MLLM先验知识。 |
large language model multimodal |
✅ |
|
| 7 |
SCA: Improve Semantic Consistent in Unrestricted Adversarial Attacks via DDPM Inversion |
提出SCA框架,通过DDPM反演和MLLM引导,提升非限制对抗攻击的语义一致性与效率。 |
large language model multimodal |
✅ |
|
| 8 |
Vinoground: Scrutinizing LMMs over Dense Temporal Reasoning with Short Videos |
Vinoground:针对短视频时序推理,评估LMM的基准数据集 |
multimodal |
|
|
| 9 |
Loong: Generating Minute-level Long Videos with Autoregressive Language Models |
Loong:提出一种基于自回归语言模型的分钟级长视频生成方法 |
large language model |
✅ |
|
| 10 |
Learning from Offline Foundation Features with Tensor Augmentations |
LOFF-TA:利用离线基础模型特征和张量增强,实现高效的资源受限场景学习 |
foundation model |
|
|
| 11 |
DTVLT: A Multi-modal Diverse Text Benchmark for Visual Language Tracking Based on LLM |
DTVLT:基于LLM的多样化文本视觉语言跟踪基准 |
large language model |
|
|
| 12 |
Parameter Competition Balancing for Model Merging |
提出PCB-Merging,通过参数竞争平衡实现高效的模型融合,提升多任务性能。 |
large language model |
✅ |
|