| 1 |
Multimodal LLMs as Customized Reward Models for Text-to-Image Generation |
提出LLaVA-Reward,利用多模态LLM为文本到图像生成定制奖励模型 |
large language model multimodal instruction following |
|
|
| 2 |
Not Only Grey Matter: OmniBrain for Robust Multimodal Classification of Alzheimer's Disease |
OmniBrain:用于阿尔茨海默病多模态稳健分类的统一框架 |
multimodal |
|
|
| 3 |
RingMo-Agent: A Unified Remote Sensing Foundation Model for Multi-Platform and Multi-Modal Reasoning |
提出RingMo-Agent,用于多平台多模态遥感图像的统一推理。 |
foundation model |
|
|
| 4 |
ATR-UMMIM: A Benchmark Dataset for UAV-Based Multimodal Image Registration under Complex Imaging Conditions |
ATR-UMMIM:无人机多模态图像配准基准数据集,应对复杂成像条件 |
multimodal |
✅ |
|
| 5 |
A Multimodal Architecture for Endpoint Position Prediction in Team-based Multiplayer Games |
提出一种多模态架构,用于预测团队多人游戏中玩家的未来位置。 |
multimodal |
|
|
| 6 |
GPT-IMAGE-EDIT-1.5M: A Million-Scale, GPT-Generated Image Dataset |
提出GPT-IMAGE-EDIT-1.5M大规模图像编辑数据集,促进开源指令引导图像编辑研究。 |
multimodal instruction following |
|
|
| 7 |
Security Tensors as a Cross-Modal Bridge: Extending Text-Aligned Safety to Vision in LVLM |
提出安全张量,将文本对齐的安全性扩展到LVLM中的视觉模态 |
large language model multimodal |
|
|
| 8 |
T2I-Copilot: A Training-Free Multi-Agent Text-to-Image System for Enhanced Prompt Interpretation and Interactive Generation |
提出T2I-Copilot,一种无需训练的多智能体文本到图像系统,增强提示理解和交互式生成。 |
large language model multimodal |
✅ |
|
| 9 |
On Explaining Visual Captioning with Hybrid Markov Logic Networks |
提出基于混合马尔可夫逻辑网络的视觉描述解释框架,提升模型可解释性。 |
multimodal |
|
|
| 10 |
METEOR: Multi-Encoder Collaborative Token Pruning for Efficient Vision Language Models |
提出METEOR,通过多编码器协同Token剪枝提升视觉语言模型效率 |
multimodal |
✅ |
|
| 11 |
KASportsFormer: Kinematic Anatomy Enhanced Transformer for 3D Human Pose Estimation on Short Sports Scene Video |
KASportsFormer:运动解剖学增强Transformer,用于短视频运动场景3D人体姿态估计 |
multimodal |
✅ |
|
| 12 |
TransPrune: Token Transition Pruning for Efficient Large Vision-Language Model |
TransPrune:面向高效大型视觉-语言模型的Token转移剪枝方法 |
multimodal |
✅ |
|
| 13 |
T2VParser: Adaptive Decomposition Tokens for Partial Alignment in Text to Video Retrieval |
提出T2VParser,通过自适应分解token实现文本到视频检索中的局部对齐。 |
multimodal |
✅ |
|