| 1 |
What Changed? Detecting and Evaluating Instruction-Guided Image Edits with Multimodal Large Language Models |
提出DICE,利用多模态大语言模型评估指令引导的图像编辑效果 |
distillation large language model multimodal |
|
|
| 2 |
From Data to Modeling: Fully Open-vocabulary Scene Graph Generation |
提出OvSGTR,实现完全开放词汇场景图生成,突破传统闭集限制。 |
distillation open-vocabulary open vocabulary |
|
|
| 3 |
Vad-R1: Towards Video Anomaly Reasoning via Perception-to-Cognition Chain-of-Thought |
提出Vad-R1,通过感知-认知链式思考实现视频异常推理 |
reinforcement learning large language model multimodal |
✅ |
|
| 4 |
MMGeoLM: Hard Negative Contrastive Learning for Fine-Grained Geometric Understanding in Large Multimodal Models |
MMGeoLM:通过难负例对比学习提升大模型在几何场景中的细粒度理解能力 |
contrastive learning multimodal |
✅ |
|
| 5 |
Multimodal Reasoning Agent for Zero-Shot Composed Image Retrieval |
提出多模态推理Agent,解决零样本组合图像检索中的误差传播问题 |
contrastive learning large language model multimodal |
|
|
| 6 |
FruitNeRF++: A Generalized Multi-Fruit Counting Method Utilizing Contrastive Learning and Neural Radiance Fields |
FruitNeRF++:利用对比学习和神经辐射场实现通用多水果计数 |
contrastive learning neural radiance field foundation model |
|
|
| 7 |
Advancements in Medical Image Classification through Fine-Tuning Natural Domain Foundation Models |
通过微调自然域预训练模型提升医学图像分类性能 |
Mamba MAE foundation model |
✅ |
|
| 8 |
ViTaPEs: Visuotactile Position Encodings for Cross-Modal Alignment in Multimodal Transformers |
ViTaPEs:用于多模态Transformer中视觉触觉对齐的视觉触觉位置编码 |
representation learning multimodal |
|
|
| 9 |
ReasonPlan: Unified Scene Prediction and Decision Reasoning for Closed-loop Autonomous Driving |
ReasonPlan:面向闭环自动驾驶的统一场景预测与决策推理框架 |
imitation learning large language model multimodal |
✅ |
|
| 10 |
Omni-R1: Reinforcement Learning for Omnimodal Reasoning via Two-System Collaboration |
Omni-R1:提出基于强化学习的双系统协作框架,解决全模态推理中长时域和像素级理解的冲突。 |
reinforcement learning foundation model multimodal |
|
|
| 11 |
FUDOKI: Discrete Flow-based Unified Understanding and Generation via Kinetic-Optimal Velocities |
提出基于离散流匹配的统一多模态模型FUDOKI,用于视觉理解和图像生成。 |
reinforcement learning flow matching large language model |
|
|
| 12 |
Ground-R1: Incentivizing Grounded Visual Reasoning via Reinforcement Learning |
提出Ground-R1,通过强化学习激励可解释的视觉推理,无需额外标注。 |
reinforcement learning chain-of-thought |
|
|
| 13 |
Harnessing the Power of Training-Free Techniques in Text-to-2D Generation for Text-to-3D Generation via Score Distillation Sampling |
探索训练无关技巧在基于SDS的文本到3D生成中的应用,优化生成质量。 |
distillation classifier-free guidance |
|
|
| 14 |
Long-Context State-Space Video World Models |
提出基于状态空间模型的长时序视频世界模型,解决视频扩散模型长程依赖问题。 |
world model SSM |
|
|
| 15 |
VisualToolAgent (VisTA): A Reinforcement Learning Framework for Visual Tool Selection |
VisTA:基于强化学习的视觉工具动态选择框架,提升视觉推理能力 |
reinforcement learning |
|
|