| 1 |
Vision Token Reduction via Attention-Driven Self-Compression for Efficient Multimodal Large Language Models |
提出ADSC,利用LLM注意力机制自压缩视觉tokens,提升多模态大模型的效率。 |
large language model multimodal |
|
|
| 2 |
Multimodal Classification via Total Correlation Maximization |
提出TCMax,通过最大化总相关解决多模态分类中的模态竞争问题。 |
multimodal |
✅ |
|
| 3 |
Reliable Thinking with Images |
提出RTWI以解决多模态大语言模型中带噪声的图像推理问题 |
large language model multimodal chain-of-thought |
|
|
| 4 |
WISE: A Multimodal Search Engine for Visual Scenes, Audio, Objects, Faces, Speech, and Metadata |
WISE:一个用于视觉场景、音频、对象、人脸、语音和元数据的多模态搜索引擎 |
multimodal |
✅ |
|
| 5 |
VimRAG: Navigating Massive Visual Context in Retrieval-Augmented Generation via Multimodal Memory Graph |
提出VimRAG,通过多模态记忆图解决RAG中长程视觉上下文推理难题 |
multimodal |
✅ |
|
| 6 |
CBEN -- A Multimodal Machine Learning Dataset for Cloud Robust Remote Sensing Image Understanding |
提出CBEN数据集,用于提升云遮挡下遥感图像理解的多模态机器学习鲁棒性 |
multimodal |
✅ |
|
| 7 |
PLLM: Pseudo-Labeling Large Language Models for CAD Program Synthesis |
提出PLLM,利用伪标签自训练CAD程序生成,解决无配对数据问题。 |
large language model |
|
|
| 8 |
Human-Aligned MLLM Judges for Fine-Grained Image Editing Evaluation: A Benchmark, Framework, and Analysis |
提出基于MLLM的细粒度图像编辑评估框架,解决传统指标粗糙、缺乏可解释性问题。 |
large language model multimodal |
|
|
| 9 |
Thinking Like a Radiologist: A Dataset for Anatomy-Guided Interleaved Vision Language Reasoning in Chest X-ray Interpretation |
提出MMRad-IVL-22K数据集,用于解剖学引导的胸部X光片判读中的交错视觉语言推理。 |
multimodal chain-of-thought |
✅ |
|
| 10 |
Towards Universal Video MLLMs with Attribute-Structured and Quality-Verified Instructions |
提出ASID-1M数据集与ASID-Captioner模型,提升通用视频多模态大模型在细粒度理解上的性能。 |
instruction following |
|
|
| 11 |
Training-Free Acceleration for Document Parsing Vision-Language Model with Hierarchical Speculative Decoding |
提出基于分层推测解码的文档解析VLM无训练加速方法,提升长文档处理效率。 |
multimodal |
|
|
| 12 |
QuEPT: Quantized Elastic Precision Transformers with One-Shot Calibration for Multi-Bit Switching |
QuEPT:一种用于Transformer的多比特切换的量化弹性精度单次校准方案。 |
large language model |
✅ |
|