| 1 |
KidVis: Do Multimodal Large Language Models Possess the Visual Perceptual Capabilities of a 6-Year-Old? |
KidVis:评估多模态大语言模型是否具备6岁儿童的视觉感知能力 |
large language model multimodal |
|
|
| 2 |
GI-Bench: A Panoramic Benchmark Revealing the Knowledge-Experience Dissociation of Multimodal Large Language Models in Gastrointestinal Endoscopy Against Clinical Standards |
GI-Bench:揭示多模态大语言模型在胃肠内窥镜临床应用中知识与经验脱节的基准 |
large language model multimodal |
✅ |
|
| 3 |
M3CoTBench: Benchmark Chain-of-Thought of MLLMs in Medical Image Understanding |
提出M3CoTBench,用于评估多模态大语言模型在医学图像理解中的思维链推理能力。 |
large language model multimodal chain-of-thought |
✅ |
|
| 4 |
Reasoning Matters for 3D Visual Grounding |
提出Reason3DVG-8B,通过合成数据和LLM微调提升3D视觉定位的推理能力。 |
large language model visual grounding |
|
|
| 5 |
Edge-Optimized Multimodal Learning for UAV Video Understanding via BLIP-2 |
提出基于BLIP-2的边缘优化多模态学习框架,用于提升无人机视频理解能力。 |
multimodal |
|
|
| 6 |
UM-Text: A Unified Multimodal Model for Image Understanding |
UM-Text:提出统一多模态模型,解决图像理解中的视觉文本编辑与风格一致性问题。 |
multimodal |
|
|
| 7 |
HIPPO: Accelerating Video Large Language Models Inference via Holistic-aware Parallel Speculative Decoding |
HIPPO:通过整体感知并行推测解码加速视频大语言模型推理 |
large language model |
|
|
| 8 |
Improving Zero-shot ADL Recognition with Large Language Models through Event-based Context and Confidence |
提出基于事件上下文和置信度的大语言模型零样本ADL识别方法 |
large language model |
|
|
| 9 |
Semantic Misalignment in Vision-Language Models under Perceptual Degradation |
研究视觉语言模型在感知退化下的语义失调问题 |
embodied AI multimodal |
|
|
| 10 |
Where Does Vision Meet Language? Understanding and Refining Visual Fusion in MLLMs via Contrastive Attention |
通过对比注意力机制理解和优化MLLM中的视觉融合 |
large language model multimodal |
|
|
| 11 |
Closed-Loop LLM Discovery of Non-Standard Channel Priors in Vision Models |
提出基于闭环LLM的通道先验发现方法,提升视觉模型性能。 |
large language model |
|
|
| 12 |
Enhancing Image Quality Assessment Ability of LMMs via Retrieval-Augmented Generation |
提出IQARAG,通过检索增强生成提升大模型在图像质量评估任务中的能力。 |
multimodal |
|
|
| 13 |
Instruction-Driven 3D Facial Expression Generation and Transition |
提出指令驱动的3D面部表情生成与过渡框架,实现逼真表情模拟。 |
multimodal |
✅ |
|