| 1 |
Towards Visual Text Grounding of Multimodal Large Language Model |
提出TRIG基准,解决多模态大语言模型在文本丰富图像上的视觉文本定位难题。 |
large language model multimodal visual grounding |
|
|
| 2 |
SCAM: A Real-World Typographic Robustness Evaluation for Multimodal Foundation Models |
SCAM:一个用于评估多模态基础模型在印刷攻击下鲁棒性的真实世界数据集 |
large language model foundation model multimodal |
|
|
| 3 |
OCC-MLLM-CoT-Alpha: Towards Multi-stage Occlusion Recognition Based on Large Language Models via 3D-Aware Supervision and Chain-of-Thoughts Guidance |
提出OCC-MLLM-CoT-Alpha,通过3D感知和CoT指导提升MLLM在遮挡识别中的性能 |
large language model chain-of-thought |
|
|
| 4 |
LEO-MINI: An Efficient Multimodal Large Language Model using Conditional Token Reduction and Mixture of Multi-Modal Experts |
LEO-MINI:利用条件Token缩减和多模态专家混合,提升多模态大语言模型的效率和视觉推理能力 |
large language model multimodal |
|
|
| 5 |
Training state-of-the-art pathology foundation models with orders of magnitude less data |
利用远少于SOTA模型的数据,训练出具有竞争力的病理学基础模型 |
foundation model |
|
|
| 6 |
The 1st Solution for 4th PVUW MeViS Challenge: Unleashing the Potential of Large Multimodal Models for Referring Video Segmentation |
利用大型多模态模型,解决运动表达视频分割难题,荣获PVUW MeViS挑战赛冠军。 |
multimodal |
|
|
| 7 |
SSLFusion: Scale & Space Aligned Latent Fusion Model for Multimodal 3D Object Detection |
SSLFusion:提出尺度与空间对齐的潜在融合模型,用于多模态3D目标检测。 |
multimodal |
|
|
| 8 |
AsyReC: A Multimodal Graph-based Framework for Spatio-Temporal Asymmetric Dyadic Relationship Classification |
AsyReC:提出基于多模态图神经网络的非对称时空二元关系分类框架 |
multimodal |
✅ |
|
| 9 |
Lumina-OmniLV: A Unified Multimodal Framework for General Low-Level Vision |
Lumina-OmniLV:用于通用底层视觉的统一多模态框架 |
multimodal |
|
|
| 10 |
REEF: Relevance-Aware and Efficient LLM Adapter for Video Understanding |
提出REEF:一种相关性感知的高效LLM适配器,用于视频理解 |
large language model foundation model |
|
|
| 11 |
URECA: Unique Region Caption Anything |
提出URECA数据集和模型,解决多粒度区域描述的唯一性和一致性问题。 |
large language model multimodal |
|
|
| 12 |
Seeking and Updating with Live Visual Knowledge |
提出LiveVQA数据集,用于评估和更新多模态大语言模型对实时视觉知识的理解能力。 |
large language model multimodal |
|
|
| 13 |
Explaining Low Perception Model Competency with High-Competency Counterfactuals |
提出五种生成高置信度反事实图像的方法,解释低感知模型能力 |
large language model multimodal |
|
|
| 14 |
InstructionBench: An Instructional Video Understanding Benchmark |
提出InstructionBench,用于评估视频大语言模型在教学视频理解中的时序推理能力。 |
large language model |
✅ |
|
| 15 |
Video-Bench: Human-Aligned Video Generation Benchmark |
提出Video-Bench:一个更符合人类感知的视频生成评估基准 |
large language model |
|
|