| 1 |
Lost in Time: Clock and Calendar Understanding Challenges in Multimodal LLMs |
揭示多模态大语言模型在时钟和日历理解方面的挑战 |
large language model multimodal |
|
|
| 2 |
QLIP: Text-Aligned Visual Tokenization Unifies Auto-Regressive Multimodal Understanding and Generation |
QLIP:文本对齐的视觉Token化统一了自回归多模态理解与生成 |
multimodal |
|
|
| 3 |
Chest X-ray Foundation Model with Global and Local Representations Integration |
CheXFound:融合全局与局部表征的胸部X光片基础模型 |
foundation model |
✅ |
|
| 4 |
Goku: Flow Based Video Generative Foundation Models |
Goku:基于流的视频生成基础模型,实现业界领先的图像和视频联合生成性能。 |
foundation model |
|
|
| 5 |
Survey on AI-Generated Media Detection: From Non-MLLM to MLLM |
综述AI生成媒体检测技术:从非MLLM到MLLM的演进与挑战 |
large language model multimodal |
|
|
| 6 |
Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuracy |
Long-VITA:一种支持百万token长上下文的多模态模型,兼顾短上下文精度 |
large language model |
|
|
| 7 |
Multitwine: Multi-Object Compositing with Text and Layout Control |
Multitwine:首个支持文本和布局控制的多对象组合生成模型 |
multimodal |
|
|
| 8 |
ELITE: Enhanced Language-Image Toxicity Evaluation for Safety |
提出ELITE基准与评估器,提升视觉语言模型安全性评估的质量与多样性 |
multimodal |
|
|