| 1 |
HuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale |
HuatuoGPT-Vision:通过注入大规模医学视觉知识提升多模态LLM的医学能力 |
large language model multimodal |
|
|
| 2 |
DocKylin: A Large Multimodal Model for Visual Document Understanding with Efficient Visual Slimming |
DocKylin:一种高效视觉精简的大型多模态文档理解模型 |
large language model multimodal |
|
|
| 3 |
ReXTime: A Benchmark Suite for Reasoning-Across-Time in Videos |
ReXTime:一个用于视频中跨时间推理的基准测试套件 |
large language model multimodal |
|
|
| 4 |
RAVEN: Multitask Retrieval Augmented Vision-Language Learning |
RAVEN:多任务检索增强的视觉-语言学习框架,提升VLM性能。 |
large language model multimodal |
|
|
| 5 |
OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding |
OMG-LLaVA:融合图像、对象和像素级推理与理解的多模态模型 |
multimodal |
|
|
| 6 |
CELLO: Causal Evaluation of Large Vision-Language Models |
提出CELLO以解决大规模视觉-语言模型因果推理问题 |
chain-of-thought |
✅ |
|
| 7 |
Revisiting Backdoor Attacks against Large Vision-Language Models from Domain Shift |
针对大视觉语言模型,提出域泛化多模态后门攻击方法MABA,提升攻击成功率。 |
multimodal |
|
|
| 8 |
ViT LoS V2X: Vision Transformers for Environment-aware LoS Blockage Prediction for 6G Vehicular Networks |
提出基于视觉Transformer的V2X环境感知LoS阻塞预测方法,用于6G车载网络。 |
multimodal |
|
|