| 1 |
EAGLE: Towards Efficient Arbitrary Referring Visual Prompts Comprehension for Multimodal Large Language Models |
EAGLE:高效理解任意指示性视觉提示的多模态大语言模型 |
large language model multimodal instruction following |
|
|
| 2 |
Unveiling Ontological Commitment in Multi-Modal Foundation Models |
提出一种从多模态模型中提取概念层级关系的方法,用于验证和校准模型。 |
foundation model multimodal |
|
|
| 3 |
First Place Solution to the ECCV 2024 BRAVO Challenge: Evaluating Robustness of Vision Foundation Models for Semantic Segmentation |
利用DINOv2视觉基础模型,结合简单分割解码器,提升语义分割的鲁棒性 |
foundation model |
✅ |
|
| 4 |
Block Expanded DINORET: Adapting Natural Domain Foundation Models for Retinal Imaging Without Catastrophic Forgetting |
提出Block Expanded DINORET,解决自然域预训练模型在视网膜成像迁移中的灾难性遗忘问题 |
foundation model |
|
|
| 5 |
Targeted Neural Architectures in Multi-Objective Frameworks for Complete Glioma Characterization from Multimodal MRI |
针对多模态MRI的神经架构,用于完整神经胶质瘤表征的多目标框架 |
multimodal |
|
|
| 6 |
ControlCity: A Multimodal Diffusion Model Based Approach for Accurate Geospatial Data Generation and Urban Morphology Analysis |
ControlCity:基于多模态扩散模型生成精确地理空间数据并分析城市形态 |
multimodal |
|
|
| 7 |
Robust Scene Change Detection Using Visual Foundation Models and Cross-Attention Mechanisms |
提出基于DINOv2和交叉注意力的鲁棒场景变更检测方法 |
foundation model |
✅ |
|
| 8 |
MaViLS, a Benchmark Dataset for Video-to-Slide Alignment, Assessing Baseline Accuracy with a Multimodal Alignment Algorithm Leveraging Speech, OCR, and Visual Features |
MaViLS:用于视频-幻灯片对齐的基准数据集与多模态对齐算法 |
multimodal |
|
|
| 9 |
Pix2Next: Leveraging Vision Foundation Models for RGB to NIR Image Translation |
Pix2Next:利用视觉基础模型实现RGB到近红外图像的转换 |
foundation model |
|
|
| 10 |
Underwater Camouflaged Object Tracking Meets Vision-Language SAM2 |
提出首个大规模水下伪装目标跟踪多模态数据集UW-COT220,并提出基于SAM2的视觉-语言跟踪框架VL-SAM2。 |
foundation model multimodal |
✅ |
|
| 11 |
ChatCam: Empowering Camera Control through Conversational AI |
ChatCam:通过对话式AI赋能相机控制,模拟专业电影摄影师工作流 |
large language model |
|
|
| 12 |
Navigating the Nuances: A Fine-grained Evaluation of Vision-Language Navigation |
提出基于上下文无关文法的视觉-语言导航细粒度评估框架 |
VLN |
|
|
| 13 |
Attention Prompting on Image for Large Vision-Language Models |
提出图像注意力提示方法,提升大视觉语言模型对文本指令的遵循能力 |
large language model |
|
|
| 14 |
DALDA: Data Augmentation Leveraging Diffusion Model and LLM with Adaptive Guidance Scaling |
DALDA:利用扩散模型和LLM进行数据增强,自适应调整引导缩放以提升少样本学习性能 |
large language model |
✅ |
|