| 1 |
What Changed? Detecting and Evaluating Instruction-Guided Image Edits with Multimodal Large Language Models |
提出DICE以解决图像编辑结果评估问题 |
distillation large language model multimodal |
|
|
| 2 |
From Data to Modeling: Fully Open-vocabulary Scene Graph Generation |
提出OvSGTR以解决传统场景图生成的开放词汇问题 |
distillation open-vocabulary open vocabulary |
|
|
| 3 |
Vad-R1: Towards Video Anomaly Reasoning via Perception-to-Cognition Chain-of-Thought |
提出Vad-R1以解决视频异常推理问题 |
reinforcement learning large language model multimodal |
✅ |
|
| 4 |
MMGeoLM: Hard Negative Contrastive Learning for Fine-Grained Geometric Understanding in Large Multimodal Models |
提出MMGeoLM以解决大规模多模态模型的几何理解问题 |
contrastive learning multimodal |
✅ |
|
| 5 |
Multimodal Reasoning Agent for Zero-Shot Composed Image Retrieval |
提出多模态推理代理以解决零样本组合图像检索问题 |
contrastive learning large language model multimodal |
|
|
| 6 |
FruitNeRF++: A Generalized Multi-Fruit Counting Method Utilizing Contrastive Learning and Neural Radiance Fields |
提出FruitNeRF++以解决多种水果计数问题 |
contrastive learning neural radiance field foundation model |
|
|
| 7 |
Advancements in Medical Image Classification through Fine-Tuning Natural Domain Foundation Models |
通过微调自然领域基础模型提升医学图像分类性能 |
Mamba MAE foundation model |
✅ |
|
| 8 |
ViTaPEs: Visuotactile Position Encodings for Cross-Modal Alignment in Multimodal Transformers |
提出ViTaPEs以解决多模态对齐问题 |
representation learning multimodal |
|
|
| 9 |
ReasonPlan: Unified Scene Prediction and Decision Reasoning for Closed-loop Autonomous Driving |
提出ReasonPlan以解决闭环自主驾驶中的决策推理问题 |
imitation learning large language model multimodal |
✅ |
|
| 10 |
Omni-R1: Reinforcement Learning for Omnimodal Reasoning via Two-System Collaboration |
提出Omni-R1以解决长视频音频推理与细粒度像素理解的矛盾问题 |
reinforcement learning foundation model multimodal |
|
|
| 11 |
FUDOKI: Discrete Flow-based Unified Understanding and Generation via Kinetic-Optimal Velocities |
提出FUDOKI以解决多模态大语言模型的局限性问题 |
reinforcement learning flow matching large language model |
|
|
| 12 |
Ground-R1: Incentivizing Grounded Visual Reasoning via Reinforcement Learning |
提出Ground-R1以解决视觉推理中的监督成本问题 |
reinforcement learning chain-of-thought |
|
|
| 13 |
Harnessing the Power of Training-Free Techniques in Text-to-2D Generation for Text-to-3D Generation via Score Distillation Sampling |
提出训练无关技术以提升文本到3D生成质量 |
distillation classifier-free guidance |
|
|
| 14 |
Long-Context State-Space Video World Models |
提出长上下文状态空间视频世界模型以解决长时记忆问题 |
world model SSM |
|
|
| 15 |
VisualToolAgent (VisTA): A Reinforcement Learning Framework for Visual Tool Selection |
提出VisTA框架以解决工具选择的动态探索问题 |
reinforcement learning |
|
|