| 1 |
TangramPuzzle: Evaluating Multimodal Large Language Models with Compositional Spatial Reasoning |
提出TangramPuzzle基准,评估多模态大语言模型在组合空间推理上的能力。 |
large language model multimodal |
|
|
| 2 |
OnlineSI: Taming Large Language Model for Online 3D Understanding and Grounding |
提出OnlineSI框架,利用大语言模型实现持续在线的3D场景理解与定位 |
large language model multimodal |
|
|
| 3 |
Emotion-LLaMAv2 and MMEVerse: A New Framework and Benchmark for Multimodal Emotion Understanding |
Emotion-LLaMAv2:多模态情感理解的端到端框架与基准 |
large language model multimodal |
|
|
| 4 |
VISTA-PATH: An interactive foundation model for pathology image segmentation and quantitative analysis in computational pathology |
VISTA-PATH:用于病理图像分割和定量分析的交互式基础模型 |
foundation model |
✅ |
|
| 5 |
Order from Chaos: Physical World Understanding from Glitchy Gameplay Videos |
利用游戏视频中的故障,构建物理世界理解数据集PhysGame和基准GameBench。 |
large language model multimodal |
|
|
| 6 |
ResAgent: Entropy-based Prior Point Discovery and Visual Reasoning for Referring Expression Segmentation |
ResAgent:提出基于熵的先验点发现和视觉推理方法,用于指代表达式分割。 |
large language model multimodal |
|
|
| 7 |
X-Aligner: Composed Visual Retrieval without the Bells and Whistles |
提出X-Aligner,用于组合视频检索,无需复杂设计即可达到SOTA |
multimodal |
|
|