| 1 |
LoC-Path: Learning to Compress for Pathology Multimodal Large Language Models |
提出LoC-Path,通过压缩冗余信息提升病理多模态大语言模型的效率。 |
MAE large language model multimodal |
|
|
| 2 |
SpectraIrisPAD: Leveraging Vision Foundation Models for Spectrally Conditioned Multispectral Iris Presentation Attack Detection |
SpectraIrisPAD:利用视觉基础模型进行光谱条件下的多光谱虹膜呈现攻击检测 |
contrastive learning foundation model |
|
|
| 3 |
DashFusion: Dual-stream Alignment with Hierarchical Bottleneck Fusion for Multimodal Sentiment Analysis |
提出DashFusion,通过双流对齐和分层瓶颈融合解决多模态情感分析中的对齐与融合问题。 |
contrastive learning multimodal |
✅ |
|
| 4 |
ParaUni: Enhance Generation in Unified Multimodal Model with Reinforcement-driven Hierarchical Parallel Information Interaction |
ParaUni:利用强化学习驱动的分层并行信息交互增强统一多模态模型的生成能力 |
reinforcement learning multimodal |
✅ |
|
| 5 |
Distilling Expert Surgical Knowledge: How to train local surgical VLMs for anatomy explanation in Complete Mesocolic Excision |
提出一种隐私保护的知识蒸馏框架,用于训练局部手术VLM以解释完全结肠系膜切除术中的解剖结构。 |
DPO direct preference optimization scene understanding |
|
|
| 6 |
EditThinker: Unlocking Iterative Reasoning for Any Image Editor |
EditThinker:解锁任意图像编辑器迭代推理能力,提升指令遵循性 |
reinforcement learning foundation model instruction following |
|
|
| 7 |
World Models That Know When They Don't Know: Controllable Video Generation with Calibrated Uncertainty |
提出C3方法,用于训练可控视频生成模型,使其具备校准的不确定性估计能力。 |
world model |
|
|
| 8 |
Probing the effectiveness of World Models for Spatial Reasoning through Test-time Scaling |
提出ViSA框架,通过空间断言改进世界模型在空间推理中的测试时缩放效果 |
world model |
✅ |
|
| 9 |
LeAD-M3D: Leveraging Asymmetric Distillation for Real-time Monocular 3D Detection |
LeAD-M3D:利用非对称蒸馏实现实时单目3D目标检测 |
distillation |
|
|
| 10 |
Training Multi-Image Vision Agents via End2End Reinforcement Learning |
提出IMAgent,通过端到端强化学习训练多图视觉Agent,解决复杂多图QA任务。 |
reinforcement learning |
|
|
| 11 |
Rethinking Infrared Small Target Detection: A Foundation-Driven Efficient Paradigm |
提出基于视觉基础模型的红外小目标检测高效框架,显著提升检测精度。 |
distillation foundation model |
✅ |
|