| 1 |
Decoding the Delta: Unifying Remote Sensing Change Detection and Understanding with Multimodal Large Language Models |
提出Delta-LLaVA,统一遥感变化检测与理解的多模态大语言模型框架 |
large language model multimodal |
|
|
| 2 |
Free Lunch for Unified Multimodal Models: Enhancing Generation via Reflective Rectification with Inherent Understanding |
提出UniRect-CoT框架,利用统一多模态模型内在理解能力提升生成质量。 |
multimodal chain-of-thought |
|
|
| 3 |
Enhanced Text-to-Image Generation by Fine-grained Multimodal Reasoning |
提出FiMR框架,通过细粒度多模态推理增强文本到图像生成。 |
large language model multimodal |
|
|
| 4 |
Why Multimodal In-Context Learning Lags Behind? Unveiling the Inner Mechanisms and Bottlenecks |
揭示多模态上下文学习滞后原因,分析其内在机制与瓶颈 |
large language model multimodal |
✅ |
|
| 5 |
POINTS-Seeker: Towards Training a Multimodal Agentic Search Model from Scratch |
提出POINTS-Seeker,从零训练多模态Agentic搜索模型,解决长程知识密集型视觉推理难题。 |
multimodal |
|
|
| 6 |
A Multimodal Clinically Informed Coarse-to-Fine Framework for Longitudinal CT Registration in Proton Therapy |
提出多模态临床信息融合的粗到细配准框架,用于质子治疗中的纵向CT配准。 |
multimodal |
|
|
| 7 |
ROSE: Retrieval-Oriented Segmentation Enhancement |
提出ROSE框架,通过检索增强解决多模态大语言模型在分割新兴实体时的知识不足问题 |
large language model multimodal |
|
|
| 8 |
Seek-and-Solve: Benchmarking MLLMs for Visual Clue-Driven Reasoning in Daily Scenarios |
提出DailyClue基准,评估MLLM在日常场景中基于视觉线索的推理能力 |
large language model multimodal |
|
|
| 9 |
SLQ: Bridging Modalities via Shared Latent Queries for Retrieval with Frozen MLLMs |
提出SLQ:通过共享隐空间查询桥接模态,实现冻结MLLM的检索 |
large language model multimodal |
|
|
| 10 |
One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding |
提出LP-Comp和QC-Comp,实现长视频理解的极端压缩,提升VLM性能。 |
large language model |
|
|
| 11 |
Training-Free Semantic Multi-Object Tracking with Vision-Language Models |
提出TF-SMOT,一种无需训练的语义多目标跟踪框架,提升视频理解能力。 |
foundation model |
|
|
| 12 |
Context Sensitivity Improves Human-Machine Visual Alignment |
提出上下文敏感相似度计算方法,提升人机视觉对齐效果 |
foundation model |
|
|
| 13 |
Efficient Multi-View 3D Object Detection by Dynamic Token Selection and Fine-Tuning |
提出动态Token选择与微调方法,高效实现多视角3D目标检测。 |
foundation model |
|
|