| 1 |
FantasyVLN: Unified Multimodal Chain-of-Thought Reasoning for Vision-Language Navigation |
提出FantasyVLN,用于视觉语言导航中统一的多模态链式思考推理,提升效率与性能。 |
VLA VLN multimodal |
|
|
| 2 |
Vision Also You Need: Navigating Out-of-Distribution Detection with Multimodal Large Language Model |
提出MM-OOD以解决图像空间的OOD检测问题 |
large language model multimodal |
|
|
| 3 |
LLM Augmented Intervenable Multimodal Adaptor for Post-operative Complication Prediction in Lung Cancer Surgery |
MIRACLE:融合临床与影像数据,可干预的LLM增强多模态适配器,用于肺癌术后并发症预测。 |
large language model multimodal |
|
|
| 4 |
Weather-R1: Logically Consistent Reinforcement Fine-Tuning for Multimodal Reasoning in Meteorology |
提出LoCo-RFT,解决气象领域多模态推理中逻辑不一致问题,并构建Weather-R1模型。 |
multimodal |
✅ |
|
| 5 |
Scaling Test-time Inference for Visual Grounding |
提出EGM:通过扩展测试时计算量提升视觉定位小模型的性能与效率。 |
visual grounding |
|
|
| 6 |
The Side Effects of Being Smart: Safety Risks in MLLMs' Multi-Image Reasoning |
提出MIR-SafetyBench,揭示多图推理能力增强的大语言模型安全风险。 |
large language model multimodal |
✅ |
|
| 7 |
Fine-Grained Zero-Shot Composed Image Retrieval with Complementary Visual-Semantic Integration |
提出CVSI模型,通过互补的视觉-语义融合实现细粒度零样本组合图像检索 |
large language model multimodal |
✅ |
|
| 8 |
XD-MAP: Cross-Modal Domain Adaptation using Semantic Parametric Mapping |
提出XD-MAP,利用语义参数化映射实现图像到LiDAR的跨模态领域自适应 |
foundation model |
|
|
| 9 |
OmniTransfer: All-in-one Framework for Spatio-temporal Video Transfer |
OmniTransfer:用于时空视频迁移的统一框架,提升视频生成灵活性和保真度 |
multimodal |
|
|
| 10 |
OCCAM: Class-Agnostic, Training-Free, Prior-Free and Multi-Class Object Counting |
提出OCCAM,一种免训练、无先验、类别无关的多类别物体计数方法。 |
foundation model |
✅ |
|
| 11 |
Insight: Interpretable Semantic Hierarchies in Vision-Language Encoders |
Insight:在视觉-语言编码器中构建可解释的语义层级结构 |
foundation model |
✅ |
|
| 12 |
HiT: History-Injection Transformers for Onboard Continuous Flood Change Detection |
提出历史注入Transformer(HiT),用于星载连续洪水变化检测,实现实时灾害评估。 |
foundation model |
✅ |
|
| 13 |
Hierarchical Long Video Understanding with Audiovisual Entity Cohesion and Agentic Search |
提出HAVEN框架,通过视听实体关联和Agent搜索实现层级长视频理解 |
multimodal |
|
|
| 14 |
Reasoning or Pattern Matching? Probing Large Vision-Language Models with Visual Puzzles |
利用视觉谜题探究大型视觉语言模型的推理能力,揭示其模式匹配局限性 |
multimodal |
|
|
| 15 |
VIAFormer: Voxel-Image Alignment Transformer for High-Fidelity Voxel Refinement |
提出VIAFormer,用于多视角图像引导下的高保真体素精细化 |
foundation model |
|
|
| 16 |
Face-Voice Association with Inductive Bias for Maximum Class Separation |
提出基于归纳偏置的最大类间分离人脸-语音关联方法 |
multimodal |
|
|
| 17 |
ChartVerse: Scaling Chart Reasoning via Reliable Programmatic Synthesis from Scratch |
ChartVerse:通过可靠的程序化从零合成,扩展图表推理能力 |
chain-of-thought |
|
|