| 1 |
NOTA: Multimodal Music Notation Understanding for Visual Large Language Model |
提出NOTA数据集与NotaGPT模型,提升视觉大语言模型对乐谱的理解能力 |
large language model multimodal |
✅ |
|
| 2 |
PRISM: Self-Pruning Intrinsic Selection Method for Training-Free Multimodal Data Selection |
PRISM:一种免训练的多模态数据自剪枝选择方法,解决视觉特征分布各向异性问题。 |
large language model multimodal |
✅ |
|
| 3 |
Language Models Can See Better: Visual Contrastive Decoding For LLM Multimodal Reasoning |
提出基于模块化视觉对比解码(MVCD)框架,提升LLM在多模态推理中的视觉感知能力。 |
large language model multimodal |
✅ |
|
| 4 |
Token Communications: A Large Model-Driven Framework for Cross-modal Context-aware Semantic Communications |
提出Token Communications框架,利用大模型驱动跨模态上下文感知语义通信。 |
large language model foundation model multimodal |
|
|
| 5 |
Defining and Evaluating Visual Language Models' Basic Spatial Abilities: A Perspective from Psychometrics |
构建心理测量框架,评估视觉语言模型的基本空间能力 |
embodied AI chain-of-thought |
|
|
| 6 |
Intuitive physics understanding emerges from self-supervised pretraining on natural videos |
利用自然视频自监督预训练,模型涌现直观物理理解能力 |
large language model multimodal |
|
|
| 7 |
Detecting Systematic Weaknesses in Vision Models along Predefined Human-Understandable Dimensions |
提出结合基础模型与组合搜索的算法,检测视觉模型中沿预定义维度存在的系统性弱点。 |
foundation model |
|
|
| 8 |
Duo Streamers: A Streaming Gesture Recognition Framework |
Duo Streamers:一种用于资源受限场景的流式手势识别框架 |
multimodal |
|
|