| 1 |
Adapting Multimodal Foundation Models for Few-Shot Learning: A Comprehensive Study on Contrastive Captioners |
研究对比式字幕模型CoCa在少样本学习中的适应性,并提出优化策略。 |
foundation model multimodal zero-shot transfer |
|
|
| 2 |
Vision-Enhanced Large Language Models for High-Resolution Image Synthesis and Multimodal Data Interpretation |
提出视觉增强LLM框架,用于高分辨率图像合成和多模态数据理解 |
large language model multimodal |
|
|
| 3 |
Reasoning Within the Mind: Dynamic Multimodal Interleaving in Latent Space |
提出DMLR框架,通过动态多模态潜在空间推理提升MLLM的推理和感知能力 |
large language model multimodal chain-of-thought |
|
|
| 4 |
DL$^3$M: A Vision-to-Language Framework for Expert-Level Medical Reasoning through Deep Learning and Large Language Models |
DL$^3$M:结合深度学习与大语言模型,实现专家级医学推理的视觉-语言框架 |
large language model |
✅ |
|
| 5 |
Lemon: A Unified and Scalable 3D Multimodal Model for Universal Spatial Understanding |
Lemon:用于通用空间理解的统一可扩展3D多模态模型 |
multimodal |
|
|
| 6 |
DrivePI: Spatial-aware 4D MLLM for Unified Autonomous Driving Understanding, Perception, Prediction and Planning |
DrivePI:面向统一自动驾驶理解、感知、预测和规划的空间感知4D MLLM |
vision-language-action VLA large language model |
✅ |
|
| 7 |
CoRe3D: Collaborative Reasoning as a Foundation for 3D Intelligence |
提出CoRe3D以解决3D智能推理不足问题 |
multimodal chain-of-thought |
|
|
| 8 |
FysicsWorld: A Unified Full-Modality Benchmark for Any-to-Any Understanding, Generation, and Reasoning |
FysicsWorld:首个统一全模态基准,支持任意模态间的理解、生成与推理。 |
large language model multimodal |
|
|
| 9 |
Efficient Vision-Language Reasoning via Adaptive Token Pruning |
提出自适应Token剪枝(ATP),高效实现视觉-语言模型的推理加速。 |
multimodal visual grounding |
|
|
| 10 |
Complex Mathematical Expression Recognition: Benchmark, Large-Scale Dataset and Strong Baseline |
提出CMER-Bench、大规模数据集和CMERNet,提升复杂数学表达式识别性能 |
large language model multimodal |
|
|
| 11 |
StreamingAssistant: Efficient Visual Token Pruning for Accelerating Online Video Understanding |
StreamingAssistant:高效视觉Token剪枝加速在线视频理解 |
large language model multimodal |
|
|
| 12 |
SignRAG: A Retrieval-Augmented System for Scalable Zero-Shot Road Sign Recognition |
SignRAG:一种可扩展的零样本道路标志识别检索增强系统 |
large language model |
|
|
| 13 |
JointAVBench: A Benchmark for Joint Audio-Visual Reasoning Evaluation |
提出 JointAVBench 基准,用于评估 Omni-LLM 在联合音视频推理方面的能力。 |
large language model |
|
|