| 1 |
Culture In a Frame: C$^3$B as a Comic-Based Benchmark for Multimodal Culturally Awareness |
提出C$^3$B:一个基于漫画的多模态文化感知能力评测基准 |
large language model multimodal |
|
|
| 2 |
Planning with Unified Multimodal Models |
提出Uni-Plan,利用统一多模态模型进行长程规划,提升决策能力。 |
large language model multimodal |
|
|
| 3 |
DentVLM: A Multimodal Vision-Language Model for Comprehensive Dental Diagnosis and Enhanced Clinical Practice |
DentVLM:用于全面牙科诊断和增强临床实践的多模态视觉-语言模型 |
multimodal |
|
|
| 4 |
Decoupling Reasoning and Perception: An LLM-LMM Framework for Faithful Visual Reasoning |
提出解耦推理与感知的LLM-LMM框架,提升视觉推理的可靠性 |
large language model multimodal chain-of-thought |
|
|
| 5 |
Learning Regional Monsoon Patterns with a Multimodal Attention U-Net |
提出基于多模态注意力U-Net的区域季风模式学习框架,提升印度降雨预测精度。 |
multimodal |
|
|
| 6 |
TATTOO: Training-free AesTheTic-aware Outfit recOmmendation |
提出TATTOO:一种无需训练且具有美学感知能力的服装搭配推荐方法 |
large language model multimodal chain-of-thought |
|
|
| 7 |
GRAPE: Let GPRO Supervise Query Rewriting by Ranking for Retrieval |
GRAPE:利用排序监督Query重写,提升检索系统在分布偏移下的性能 |
large language model multimodal |
✅ |
|
| 8 |
Seeing Symbols, Missing Cultures: Probing Vision-Language Models' Reasoning on Fire Imagery and Cultural Meaning |
提出火主题文化图像诊断框架,揭示视觉-语言模型在文化理解上的偏差 |
multimodal |
|
|
| 9 |
SynDoc: A Hybrid Discriminative-Generative Framework for Enhancing Synthetic Domain-Adaptive Document Key Information Extraction |
SynDoc:一种混合判别-生成框架,用于增强合成领域自适应文档关键信息提取。 |
multimodal |
|
|
| 10 |
Self-Consistency as a Free Lunch: Reducing Hallucinations in Vision-Language Models via Self-Reflection |
提出基于自反思的自洽性方法,减少视觉-语言模型中的幻觉问题 |
instruction following |
|
|
| 11 |
Uncovering Intrinsic Capabilities: A Paradigm for Data Curation in Vision-Language Models |
提出能力归因数据精选框架CADC,提升视觉-语言模型指令调优效率。 |
multimodal |
|
|