| 1 |
Are Multimodal Large Language Models Good Annotators for Image Tagging? |
提出TagLLM框架,提升多模态大语言模型在图像标签任务中的标注质量。 |
large language model multimodal |
|
|
| 2 |
Efficient and Explainable End-to-End Autonomous Driving via Masked Vision-Language-Action Diffusion |
提出MVLAD-AD,通过掩码扩散模型实现高效、可解释的端到端自动驾驶。 |
vision-language-action large language model |
|
|
| 3 |
CrystaL: Spontaneous Emergence of Visual Latents in MLLMs |
CrystaL:MLLM中视觉隐变量的自发涌现,提升细粒度视觉理解 |
large language model multimodal chain-of-thought |
|
|
| 4 |
OrthoDiffusion: A Generalizable Multi-Task Diffusion Foundation Model for Musculoskeletal MRI Interpretation |
OrthoDiffusion:用于肌肉骨骼MRI解释的通用多任务扩散模型 |
foundation model |
|
|
| 5 |
An interactive enhanced driving dataset for autonomous driving |
提出交互增强驾驶数据集IEDD,解决自动驾驶VLA模型数据稀疏和多模态对齐不足问题。 |
vision-language-action VLA multimodal |
|
|
| 6 |
UDVideoQA: A Traffic Video Question Answering Dataset for Multi-Object Spatio-Temporal Reasoning in Urban Dynamics |
提出UDVideoQA数据集,用于城市交通视频中多目标时空推理的视频问答任务。 |
multimodal visual grounding |
✅ |
|
| 7 |
OmniOCR: Generalist OCR for Ethnic Minority Languages |
OmniOCR:面向少数民族语言的通用OCR框架,提升低资源场景识别精度。 |
foundation model multimodal |
✅ |
|
| 8 |
Skullptor: High Fidelity 3D Head Reconstruction in Seconds with Multi-View Normal Prediction |
Skullptor:基于多视角法线预测的快速高保真3D头部重建 |
foundation model |
|
|
| 9 |
VII: Visual Instruction Injection for Jailbreaking Image-to-Video Generation Models |
提出VII框架,通过视觉指令注入破解图生视频模型的安全限制。 |
instruction following |
|
|
| 10 |
Cycle-Consistent Tuning for Layered Image Decomposition |
提出循环一致性微调方法,用于基于扩散模型的图像分层解耦 |
foundation model |
|
|
| 11 |
On the Explainability of Vision-Language Models in Art History |
研究CLIP在艺术史领域的视觉推理可解释性,评估XAI方法有效性。 |
multimodal |
|
|