| 1 |
SkinCaRe: A Multimodal Dermatology Dataset Annotated with Medical Caption and Chain-of-Thought Reasoning |
SkinCaRe:一个包含医学描述和思维链推理的多模态皮肤病学数据集 |
large language model multimodal chain-of-thought |
✅ |
|
| 2 |
Visual Anchors Are Strong Information Aggregators For Multimodal Large Language Model |
提出AcFormer:一种基于视觉锚点的低成本高效多模态大语言模型连接器 |
large language model multimodal |
✅ |
|
| 3 |
EffoVPR: Effective Foundation Model Utilization for Visual Place Recognition |
EffoVPR:利用有效的基础模型进行视觉定位识别,实现零样本和单阶段SOTA性能。 |
foundation model |
|
|
| 4 |
Reliable Object Tracking by Multimodal Hybrid Feature Extraction and Transformer-Based Fusion |
提出基于多模态混合特征提取和Transformer融合的MMHT模型,提升复杂场景下的目标跟踪可靠性。 |
multimodal |
|
|
| 5 |
White-box Multimodal Jailbreaks Against Large Vision-Language Models |
提出白盒多模态越狱攻击方法,提升视觉-语言模型对抗鲁棒性评估 |
multimodal |
|
|
| 6 |
XTrack: Multimodal Training Boosts RGB-X Video Object Trackers |
XTrack:多模态训练提升RGB-X视频目标跟踪器性能 |
multimodal |
✅ |
|
| 7 |
MMPareto: Boosting Multimodal Learning with Innocent Unimodal Assistance |
MMPareto:通过无害的单模态辅助提升多模态学习性能 |
multimodal |
✅ |
|
| 8 |
Mitigating Object Hallucination in MLLMs via Data-augmented Phrase-level Alignment |
提出数据增强的短语级对齐方法DPA,缓解多模态大语言模型中的对象幻觉问题。 |
large language model multimodal |
|
|
| 9 |
Multi-modal Generation via Cross-Modal In-Context Learning |
提出MGCC,利用跨模态上下文学习生成多模态提示序列的新图像。 |
large language model multimodal |
✅ |
|
| 10 |
Intent3D: 3D Object Detection in RGB-D Scans Based on Human Intention |
提出Intent3D数据集和IntentNet模型,实现基于人类意图的RGB-D场景3D目标检测。 |
visual grounding |
✅ |
|
| 11 |
Text-only Synthesis for Image Captioning |
提出ToCa,利用纯文本合成方法进行图像描述生成,显著提升零样本泛化能力。 |
large language model |
|
|
| 12 |
VeLoRA: Memory Efficient Training using Rank-1 Sub-Token Projections |
VeLoRA:利用Rank-1子Token投影实现内存高效的LLM训练 |
large language model |
|
|
| 13 |
Seeing the Image: Prioritizing Visual Correlation by Contrastive Alignment |
提出对比对齐(CAL)方法,通过视觉相关性区分文本token重要性,优化视觉语言模型。 |
multimodal |
✅ |
|
| 14 |
MMDisCo: Multi-Modal Discriminator-Guided Cooperative Diffusion for Joint Audio and Video Generation |
MMDisCo:利用多模态判别器引导协同扩散,实现联合音视频生成 |
multimodal |
✅ |
|