| 1 |
LLaVA-ST: A Multimodal Large Language Model for Fine-Grained Spatial-Temporal Understanding |
LLaVA-ST:用于细粒度时空理解的多模态大语言模型 |
large language model multimodal |
✅ |
|
| 2 |
Zero-shot Video Moment Retrieval via Off-the-shelf Multimodal Large Language Models |
提出Moment-GPT,利用冻结的多模态大语言模型实现零样本视频片段检索。 |
large language model multimodal |
|
|
| 3 |
Parameter-Inverted Image Pyramid Networks for Visual Perception and Multimodal Understanding |
提出参数倒置图像金字塔网络(PIIP),以低计算成本提升视觉感知和多模态理解性能。 |
large language model foundation model multimodal |
✅ |
|
| 4 |
Advancing Semantic Future Prediction through Multimodal Visual Sequence Transformers |
FUTURIST:提出基于多模态视觉序列Transformer的语义未来预测方法 |
multimodal |
|
|
| 5 |
Benchmarking Multimodal Models for Fine-Grained Image Analysis: A Comparative Study Across Diverse Visual Features |
构建多模态图像分析基准,评估模型在细粒度视觉特征理解上的能力 |
multimodal |
|
|
| 6 |
Benchmarking Vision Foundation Models for Input Monitoring in Autonomous Driving |
利用视觉基础模型进行自动驾驶输入监控的异常检测 |
foundation model |
|
|
| 7 |
Facial Dynamics in Video: Instruction Tuning for Improved Facial Expression Perception and Contextual Awareness |
提出FaceTrack-MM与FEC-Bench,提升视频MLLM在动态面部表情感知和上下文理解能力 |
large language model multimodal instruction following |
|
|
| 8 |
Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks |
Omni-RGPT:通过Token Mark统一图像和视频的区域级理解 |
large language model multimodal |
|
|
| 9 |
Vchitect-2.0: Parallel Transformer for Scaling Up Video Diffusion Models |
Vchitect-2.0:并行Transformer架构,扩展视频扩散模型用于大规模文本到视频生成。 |
multimodal |
|
|