| 1 |
Valley2: Exploring Multimodal Models with Scalable Vision-Language Design |
Valley2:探索可扩展视觉-语言设计的多模态模型,提升电商和短视频场景性能。 |
large language model multimodal |
✅ |
|
| 2 |
Text-to-Edit: Controllable End-to-End Video Ad Creation via Multimodal LLMs |
提出Text-to-Edit框架,利用多模态LLM实现可控的端到端视频广告创作 |
large language model multimodal |
|
|
| 3 |
Towards Iris Presentation Attack Detection with Foundation Models |
利用DinoV2和VisualOpenClip等预训练模型提升虹膜活体检测性能 |
foundation model |
|
|
| 4 |
BRIGHT: A globally distributed multimodal building damage assessment dataset with very-high-resolution for all-weather disaster response |
提出BRIGHT多模态建筑损伤评估数据集,用于全天候灾害响应AI模型训练。 |
multimodal |
✅ |
|
| 5 |
A Multimodal Dataset for Enhancing Industrial Task Monitoring and Engagement Prediction |
提出MIAM多模态数据集,用于提升工业任务监控和人机协作中的行为预测。 |
multimodal |
✅ |
|
| 6 |
PEACE: Empowering Geologic Map Holistic Understanding with MLLMs |
提出GeoMap-Agent,赋能多模态大语言模型理解地质图,提升地质调查效率。 |
large language model multimodal |
|
|
| 7 |
Scalable Vision Language Model Training via High Quality Data Curation |
SAIL-VL:通过高质量数据构建实现可扩展的视觉语言模型训练 |
multimodal |
✅ |
|
| 8 |
VideoRAG: Retrieval-Augmented Generation over Video Corpus |
提出VideoRAG,通过检索增强生成提升视频语料上的问答准确性 |
multimodal |
✅ |
|