| 1 |
Using Multimodal Large Language Models for Automated Detection of Traffic Safety Critical Events |
利用多模态大语言模型自动检测交通安全关键事件 |
large language model multimodal |
|
|
| 2 |
Through the Theory of Mind's Eye: Reading Minds with Multimodal Video Large Language Models |
提出基于多模态视频大语言模型的心理理论(ToM)推理框架 |
large language model multimodal |
|
|
| 3 |
MC-MKE: A Fine-Grained Multimodal Knowledge Editing Benchmark Emphasizing Modality Consistency |
MC-MKE:提出一个细粒度的多模态知识编辑基准,强调模态一致性,用于评估和纠正MLLM中的错误。 |
large language model multimodal |
|
|
| 4 |
Biomedical Visual Instruction Tuning with Clinician Preference Alignment |
BioMed-VITAL:通过临床医生偏好对齐进行生物医学视觉指令调优 |
foundation model multimodal instruction following |
|
|
| 5 |
VisualRWKV: Exploring Recurrent Neural Networks for Visual Language Models |
提出VisualRWKV,将线性RNN应用于视觉语言模型,实现高效多模态学习。 |
large language model multimodal |
✅ |
|
| 6 |
GUI Action Narrator: Where and When Did That Action Take Place? |
提出GUI Narrator框架与Act2Cap数据集,用于提升多模态LLM在GUI动作视频理解上的性能。 |
multimodal |
|
|
| 7 |
IntCoOp: Interpretability-Aware Vision-Language Prompt Tuning |
IntCoOp:一种可解释的视觉-语言提示调优方法,提升图像-文本对齐。 |
zero-shot transfer |
|
|
| 8 |
SpatialBot: Precise Spatial Understanding with Vision Language Models |
SpatialBot:利用视觉语言模型实现精确的空间理解 |
embodied AI |
✅ |
|
| 9 |
Semantic Enhanced Few-shot Object Detection |
提出语义增强的少样本目标检测框架,提升新类别检测性能 |
multimodal |
|
|
| 10 |
SituationalLLM: Proactive language models with scene awareness for dynamic, contextual task guidance |
SituationalLLM:提出一种具备场景感知能力的主动式语言模型,用于动态上下文任务指导。 |
large language model |
|
|
| 11 |
Neural Residual Diffusion Models for Deep Scalable Vision Generation |
提出神经残差扩散模型(Neural-RDM),解决深度视觉生成模型的可扩展性问题。 |
large language model |
✅ |
|
| 12 |
Block-level Text Spotting with LLMs |
提出BTS-LLM,利用大语言模型进行图像块级文本定位与识别。 |
large language model |
|
|