| 1 |
RestoreAgent: Autonomous Image Restoration Agent via Multimodal Large Language Models |
提出RestoreAgent,利用多模态大语言模型实现自主图像修复,解决复杂退化问题。 |
large language model multimodal |
|
|
| 2 |
Efficient Inference of Vision Instruction-Following Models with Elastic Cache |
提出Elastic Cache,加速视觉指令跟随模型推理,降低KV缓存内存需求 |
multimodal instruction following |
✅ |
|
| 3 |
Retinal IPA: Iterative KeyPoints Alignment for Multimodal Retinal Imaging |
提出Retinal IPA,用于多模态视网膜图像配准的关键点对齐 |
multimodal |
✅ |
|
| 4 |
Sparse vs Contiguous Adversarial Pixel Perturbations in Multimodal Models: An Empirical Analysis |
研究多模态模型在稀疏与连续对抗像素扰动下的鲁棒性 |
multimodal |
|
|
| 5 |
KiVA: Kid-inspired Visual Analogies for Testing Large Multimodal Models |
提出KiVA基准以测试大型多模态模型的视觉类比推理能力 |
multimodal |
|
|
| 6 |
ERIT Lightweight Multimodal Dataset for Elderly Emotion Recognition and Multimodal Fusion Evaluation |
ERIT:用于老年人情感识别和多模态融合评估的轻量级多模态数据集 |
multimodal |
|
|
| 7 |
Enhancing Model Performance: Another Approach to Vision-Language Instruction Tuning |
提出Bottleneck Adapter,用于增强视觉-语言指令调优模型性能 |
large language model multimodal |
|
|
| 8 |
RefMask3D: Language-Guided Transformer for 3D Referring Segmentation |
RefMask3D:一种用于3D指代表达分割的语言引导Transformer网络 |
visual grounding |
✅ |
|
| 9 |
MARINE: A Computer Vision Model for Detecting Rare Predator-Prey Interactions in Animal Videos |
MARINE:用于检测动物视频中罕见捕食者-猎物交互的计算机视觉模型 |
foundation model |
|
|
| 10 |
Unified Lexical Representation for Interpretable Visual-Language Alignment |
提出LexVLA,通过统一词汇表征实现可解释的视觉-语言对齐。 |
VLA |
✅ |
|
| 11 |
A Unified Understanding of Adversarial Vulnerability Regarding Unimodal Models and Vision-Language Pre-training Models |
提出特征引导攻击FGA及其改进FGA-T,用于评估和提升视觉-语言预训练模型的鲁棒性 |
multimodal |
|
|
| 12 |
DAC: 2D-3D Retrieval with Noisy Labels via Divide-and-Conquer Alignment and Correction |
提出DAC框架,通过分而治之的对齐和校正方法解决带噪标签的2D-3D跨模态检索问题。 |
multimodal |
|
|