| 1 |
Forgotten Polygons: Multimodal Large Language Models are Shape-Blind |
揭示多模态大语言模型在几何形状识别上的缺陷,并提出视觉提示链式思考方法。 |
large language model multimodal chain-of-thought |
✅ |
|
| 2 |
Multi-Agent Multimodal Models for Multicultural Text to Image Generation |
提出MosAIG多智能体框架,增强多文化文本到图像生成效果。 |
large language model multimodal |
✅ |
|
| 3 |
M3-AGIQA: Multimodal, Multi-Round, Multi-Aspect AI-Generated Image Quality Assessment |
M3-AGIQA:多模态多轮多角度评估AI生成图像质量 |
large language model multimodal |
✅ |
|
| 4 |
ELIP: Enhanced Visual-Language Foundation Models for Image Retrieval |
ELIP:增强视觉-语言基础模型,提升图像检索性能 |
foundation model |
|
|
| 5 |
LongCaptioning: Unlocking the Power of Long Video Caption Generation in Large Multimodal Models |
提出LongCaption-Agent框架,解决大模型长视频描述生成中长文本标注稀缺问题 |
multimodal |
|
|
| 6 |
M2LADS Demo: A System for Generating Multimodal Learning Analytics Dashboards |
M2LADS:用于生成多模态学习分析仪表板的系统 |
multimodal |
|
|
| 7 |
Fish feeding behavior recognition and intensity quantification methods in aquaculture: From single modality analysis to multimodality fusion |
综述水产养殖中鱼类摄食行为识别与强度量化方法,从单模态分析到多模态融合 |
multimodal |
|
|
| 8 |
Memory Helps, but Confabulation Misleads: Understanding Streaming Events in Videos with MLLMs |
提出一种基于记忆修正的多模态大语言模型,提升其在流式视频事件理解中的性能。 |
large language model multimodal |
|
|
| 9 |
MOVE: A Mixture-of-Vision-Encoders Approach for Domain-Focused Vision-Language Processing |
提出MOVE:一种混合视觉编码器方法,用于领域聚焦的视觉-语言处理 |
large language model multimodal |
|
|
| 10 |
WorldCraft: Photo-Realistic 3D World Creation and Customization via LLM Agents |
WorldCraft:利用LLM Agent实现照片级真实3D世界创建与定制 |
large language model |
|
|
| 11 |
AutoMR: A Universal Time Series Motion Recognition Pipeline |
AutoMR:通用时序运动识别流水线,解决多模态数据处理难题。 |
multimodal |
|
|