| 1 |
GAP-MLLM: Geometry-Aligned Pre-training for Activating 3D Spatial Perception in Multimodal Large Language Models |
提出GAP-MLLM,通过几何对齐预训练提升多模态大语言模型3D空间感知能力 |
large language model multimodal visual grounding |
|
|
| 2 |
VisBrowse-Bench: Benchmarking Visual-Native Search for Multimodal Browsing Agents |
VisBrowse-Bench:用于多模态浏览代理的视觉原生搜索基准 |
large language model multimodal |
✅ |
|
| 3 |
When Thinking Hurts: Mitigating Visual Forgetting in Video Reasoning via Frame Repetition |
提出FrameRepeat框架,通过帧重复缓解视频推理中视觉信息遗忘问题 |
large language model multimodal chain-of-thought |
|
|
| 4 |
KidsNanny: A Two-Stage Multimodal Content Moderation Pipeline Integrating Visual Classification, Object Detection, OCR, and Contextual Reasoning for Child Safety |
KidsNanny:用于儿童安全的双阶段多模态内容审核流水线 |
multimodal |
|
|
| 5 |
Fast-WAM: Do World Action Models Need Test-time Future Imagination? |
提出Fast-WAM,无需测试时未来想象,加速具身控制任务。 |
vision-language-action VLA |
✅ |
|
| 6 |
Kestrel: Grounding Self-Refinement for LVLM Hallucination Mitigation |
Kestrel:提出基于视觉 grounding 和自精炼的 LVLM 幻觉缓解框架 |
multimodal visual grounding |
|
|
| 7 |
MLLM-based Textual Explanations for Face Comparison |
分析MLLM在人脸比对中生成解释的可靠性,揭示其幻觉问题 |
large language model multimodal |
✅ |
|
| 8 |
InViC: Intent-aware Visual Cues for Medical Visual Question Answering |
提出InViC框架,通过意图感知视觉线索提升医学VQA中图像理解能力。 |
large language model multimodal |
|
|
| 9 |
360° Image Perception with MLLMs: A Comprehensive Benchmark and a Training-Free Method |
提出Free360,一种无需训练的360°图像VQA框架,提升MLLM在全景图像理解能力。 |
large language model multimodal |
|
|
| 10 |
What DINO saw: ALiBi positional encoding reduces positional bias in Vision Transformers |
使用ALiBi位置编码减少Vision Transformer中的位置偏差,提升零样本迁移能力 |
foundation model |
|
|
| 11 |
Retrieving Counterfactuals Improves Visual In-Context Learning |
提出CIRCLES框架,通过检索反事实样本提升视觉上下文学习能力 |
multimodal |
✅ |
|
| 12 |
World Reconstruction From Inconsistent Views |
提出一种非刚性对齐方法,从不一致的视频帧中重建3D世界。 |
foundation model |
|
|
| 13 |
BUSSARD: Normalizing Flows for Bijective Universal Scene-Specific Anomalous Relationship Detection |
提出BUSSARD,用标准化流进行双射通用场景特定异常关系检测 |
multimodal |
✅ |
|
| 14 |
VIEW2SPACE: Studying Multi-View Visual Reasoning from Sparse Observations |
提出VIEW2SPACE基准,研究稀疏视角下的多视角视觉推理,并提出Grounded Chain-of-Thought方法。 |
chain-of-thought |
|
|
| 15 |
Cross-modal learning for plankton recognition |
提出基于跨模态自监督学习的浮游生物识别方法,利用图像和光学测量数据提升识别精度。 |
multimodal |
✅ |
|
| 16 |
Persistent Story World Simulation with Continuous Character Customization |
EverTale:提出持续角色定制的故事世界模拟器,解决角色一致性与场景融合问题。 |
chain-of-thought |
|
|
| 17 |
Grounding the Score: Explicit Visual Premise Verification for Reliable Vision-Language Process Reward Models |
提出EVPV,通过显式视觉前提验证提升视觉-语言过程奖励模型的可靠性。 |
multimodal |
✅ |
|
| 18 |
Rethinking UMM Visual Generation: Masked Modeling for Efficient Image-Only Pre-training |
提出IOMM:通过图像掩码建模实现高效的UMM视觉生成预训练 |
multimodal |
✅ |
|
| 19 |
Boosting Quantitive and Spatial Awareness for Zero-Shot Object Counting |
提出QICA框架,提升零样本物体计数中的数量感知和空间感知能力 |
zero-shot transfer |
|
|