| 1 |
MusiXQA: Advancing Visual Music Understanding in Multimodal Large Language Models |
提出MusiXQA数据集,用于提升多模态大语言模型在乐谱理解方面的能力 |
large language model multimodal |
|
|
| 2 |
MANTA: Cross-Modal Semantic Alignment and Information-Theoretic Optimization for Long-form Multimodal Understanding |
MANTA:通过跨模态语义对齐和信息论优化实现长程多模态理解 |
large language model multimodal |
|
|
| 3 |
MOTOR: Multimodal Optimal Transport via Grounded Retrieval in Medical Visual Question Answering |
提出MOTOR,一种基于多模态最优传输的医学视觉问答方法,提升临床相关性。 |
multimodal |
✅ |
|
| 4 |
Iterative Zoom-In: Temporal Interval Exploration for Long Video Understanding |
提出Temporal Search框架,通过迭代缩放时间区间提升MLLM长视频理解能力 |
large language model multimodal |
|
|
| 5 |
Mask-aware Text-to-Image Retrieval: Referring Expression Segmentation Meets Cross-modal Retrieval |
提出Mask-aware TIR,融合文本到图像检索与指代表达分割,提升检索精度与可解释性。 |
large language model multimodal |
|
|
| 6 |
ActAlign: Zero-Shot Fine-Grained Video Classification via Language-Guided Sequence Alignment |
ActAlign:通过语言引导的序列对齐实现零样本细粒度视频分类 |
large language model |
|
|
| 7 |
Prompting without Panic: Attribute-aware, Zero-shot, Test-Time Calibration |
提出属性感知零样本测试时校准方法,解决VLM测试时微调的置信度校准问题 |
large language model |
✅ |
|