| 1 |
MME-Unify: A Comprehensive Benchmark for Unified Multimodal Understanding and Generation Models |
MME-Unify:一个用于统一多模态理解与生成模型的综合性评测基准。 |
multimodal |
✅ |
|
| 2 |
Multimodal Diffusion Bridge with Attention-Based SAR Fusion for Satellite Image Cloud Removal |
提出DB-CR:一种基于注意力SAR融合的多模态扩散桥卫星图像去云方法 |
multimodal |
|
|
| 3 |
RANa: Retrieval-Augmented Navigation |
提出RANa:一种检索增强的导航方法,利用历史经验提升机器人导航性能。 |
foundation model zero-shot transfer |
|
|
| 4 |
VISTA-OCR: Towards generative and interactive end to end OCR models |
提出VISTA-OCR,一个生成式交互式端到端OCR模型,统一文本检测与识别。 |
large language model multimodal |
|
|
| 5 |
VideoComp: Advancing Fine-Grained Compositional and Temporal Alignment in Video-Text Models |
VideoComp:提升视频-文本模型在细粒度组合性和时间对齐方面的能力 |
multimodal |
|
|
| 6 |
Can ChatGPT Learn My Life From a Week of First-Person Video? |
利用第一人称视频,探索ChatGPT学习个人生活信息的能力 |
foundation model |
|
|
| 7 |
ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use |
ScreenSpot-Pro:针对专业高分辨率计算机使用的GUI定位基准与ScreenSeekeR方法 |
large language model |
✅ |
|
| 8 |
Know What You do Not Know: Verbalized Uncertainty Estimation Robustness on Corrupted Images in Vision-Language Models |
研究视觉语言模型在图像损坏下的不确定性估计鲁棒性问题 |
large language model |
|
|
| 9 |
TokenFLEX: Unified VLM Training for Flexible Visual Tokens Inference |
TokenFLEX:提出一种统一的VLM训练框架,实现视觉tokens数量的灵活推理。 |
large language model |
|
|