| 1 |
TAMP: Token-Adaptive Layerwise Pruning in Multimodal Large Language Models |
TAMP:多模态大语言模型中基于Token自适应的层级剪枝 |
large language model multimodal TAMP |
|
|
| 2 |
COUNTS: Benchmarking Object Detectors and Multimodal Large Language Models under Distribution Shifts |
提出COUNTS数据集与O(OD)2、OODG基准,评估目标检测器和多模态大模型在分布偏移下的泛化能力。 |
large language model multimodal visual grounding |
|
|
| 3 |
Mavors: Multi-granularity Video Representation for Multimodal Large Language Model |
Mavors:多粒度视频表示用于多模态大语言模型,提升长视频理解能力 |
large language model multimodal |
|
|
| 4 |
Multimodal Long Video Modeling Based on Temporal Dynamic Context |
提出基于时序动态上下文的TDC模型,解决长视频多模态理解中的信息丢失问题。 |
large language model multimodal chain-of-thought |
✅ |
|
| 5 |
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models |
InternVL3:探索开源多模态模型的高级训练和测试方法 |
large language model multimodal |
|
|
| 6 |
Enhancing Multi-task Learning Capability of Medical Generalist Foundation Model via Image-centric Multi-annotation Data |
提出IMAX数据集,提升医学通用Foundation模型的多任务学习能力 |
large language model foundation model |
|
|
| 7 |
Integrating Vision and Location with Transformers: A Multimodal Deep Learning Framework for Medical Wound Analysis |
提出基于Transformer的多模态深度学习框架,用于医学伤口图像的分类与定位分析。 |
multimodal |
|
|
| 8 |
Summarization of Multimodal Presentations with Vision-Language Models: Study of the Effect of Modalities and Structure |
利用视觉-语言模型进行多模态演示文稿摘要,研究模态和结构的影响 |
multimodal |
|
|
| 9 |
Relation-Rich Visual Document Generator for Visual Information Extraction |
提出RIDGE,通过内容驱动的布局生成,解决关系丰富的视觉文档信息抽取问题。 |
large language model multimodal |
✅ |
|
| 10 |
MIEB: Massive Image Embedding Benchmark |
MIEB:大规模图像嵌入基准,用于全面评估图像和图像-文本嵌入模型。 |
large language model multimodal |
✅ |
|
| 11 |
The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer |
提出SAIL:单Transformer统一多模态大语言模型,提升视觉-语言学习的可扩展性 |
large language model multimodal |
✅ |
|
| 12 |
CameraBench: Benchmarking Visual Reasoning in MLLMs via Photography |
CameraBench:通过摄影评估多模态大语言模型中的视觉推理能力 |
large language model multimodal |
|
|
| 13 |
Socratic Chart: Cooperating Multiple Agents for Robust SVG Chart Understanding |
提出Socratic Chart框架,通过多智能体协作提升MLLM在SVG图表理解中的鲁棒性。 |
large language model multimodal |
|
|
| 14 |
SlowFastVAD: Video Anomaly Detection via Integrating Simple Detector and RAG-Enhanced Vision-Language Model |
提出SlowFastVAD,融合快速检测器与RAG增强的视觉语言模型,用于高效可解释的视频异常检测。 |
multimodal |
|
|
| 15 |
XY-Cut++: Advanced Layout Ordering via Hierarchical Mask Mechanism on a Novel Benchmark |
提出XY-Cut++,通过层级掩码机制实现文档布局排序的显著提升 |
large language model |
|
|
| 16 |
DTFSal: Audio-Visual Dynamic Token Fusion for Video Saliency Prediction |
提出DTFSal以解决音视频显著性预测中的多模态融合问题 |
multimodal |
|
|