| 1 |
Automating Video Thumbnails Selection and Generation with Multimodal and Multistage Analysis |
提出一种多模态多阶段分析方法,自动选择和生成高质量视频缩略图。 |
large language model multimodal |
|
|
| 2 |
Swiss Army Knife: Synergizing Biases in Knowledge from Vision Foundation Models for Multi-Task Learning |
提出Swiss Army Knife,融合视觉基础模型知识偏见,提升多任务学习性能。 |
foundation model |
✅ |
|
| 3 |
ViCToR: Improving Visual Comprehension via Token Reconstruction for Pretraining LMMs |
ViCToR:通过视觉Token重建提升LMMs的视觉理解能力 |
large language model multimodal |
✅ |
|
| 4 |
Toward Generalizing Visual Brain Decoding to Unseen Subjects |
提出一种通用的视觉脑解码框架,提升模型在未见个体上的泛化能力 |
foundation model |
✅ |
|
| 5 |
Vision-Language Navigation with Energy-Based Policy |
提出基于能量的导航策略以解决视觉语言导航问题 |
VLN |
|
|
| 6 |
Storyboard guided Alignment for Fine-grained Video Action Recognition |
提出基于故事板引导对齐的细粒度视频动作识别方法 |
large language model |
|
|
| 7 |
Fine-Grained Verifiers: Preference Modeling as Next-token Prediction in Vision-Language Alignment |
提出FiSAO,利用视觉编码器进行token级反馈,提升视觉-语言模型对齐效果 |
large language model |
|
|
| 8 |
ProReason: Multi-Modal Proactive Reasoning with Decoupled Eyesight and Wisdom |
ProReason:解耦视觉感知与文本推理,实现多模态主动推理 |
large language model |
|
|