| 1 |
UniVG-R1: Reasoning Guided Universal Visual Grounding with Reinforcement Learning |
提出UniVG-R1,通过强化学习增强推理能力,解决通用视觉定位任务。 |
reinforcement learning large language model multimodal |
✅ |
|
| 2 |
UniGen: Enhanced Training & Test-Time Strategies for Unified Multimodal Understanding and Generation |
UniGen:通过增强训练和测试策略实现统一多模态理解与生成 |
direct preference optimization large language model multimodal |
|
|
| 3 |
Visionary-R1: Mitigating Shortcuts in Visual Reasoning with Reinforcement Learning |
Visionary-R1:通过强化学习缓解视觉推理中的捷径学习问题 |
reinforcement learning large language model multimodal |
|
|
| 4 |
Programmatic Video Prediction Using Large Language Models |
ProgGen:利用大语言模型进行可解释的程序化视频预测 |
world model large language model |
|
|
| 5 |
Investigating and Enhancing the Robustness of Large Multimodal Models Against Temporal Inconsistency |
提出TemRobBench基准与PanoDPO优化方法,提升大模型在时序一致性扰动下的鲁棒性。 |
direct preference optimization multimodal |
|
|
| 6 |
VisualQuality-R1: Reasoning-Induced Image Quality Assessment via Reinforcement Learning to Rank |
提出VisualQuality-R1,通过强化学习排序实现推理驱动的图像质量评估。 |
reinforcement learning large language model |
|
|
| 7 |
DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning |
DeepEyes:通过强化学习激励视觉语言模型进行“图像思考” |
reinforcement learning multimodal |
✅ |
|
| 8 |
Towards Omnidirectional Reasoning with 360-R1: A Dataset, Benchmark, and GRPO-based Method |
提出OmniVQA数据集与360-R1方法,提升全景视觉问答能力 |
reinforcement learning embodied AI large language model |
|
|
| 9 |
StPR: Spatiotemporal Preservation and Routing for Exemplar-Free Video Class-Incremental Learning |
提出StPR框架,通过时空信息解耦与保持,解决免样本视频类增量学习问题。 |
distillation spatiotemporal |
|
|
| 10 |
Intra-class Patch Swap for Self-Distillation |
提出一种基于类内块交换的自蒸馏方法,无需教师网络即可提升模型性能。 |
teacher-student distillation |
✅ |
|
| 11 |
MultiMAE Meets Earth Observation: Pre-training Multi-modal Multi-task Masked Autoencoders for Earth Observation Tasks |
提出MultiMAE地球观测预训练方法,提升多模态遥感数据下游任务性能。 |
masked autoencoder |
✅ |
|
| 12 |
RETRO: REthinking Tactile Representation Learning with Material PriOrs |
提出材料感知先验以提升触觉表示学习的准确性 |
representation learning |
|
|
| 13 |
Unify Graph Learning with Text: Unleashing LLM Potentials for Session Search |
提出符号图排序器SGR,利用LLM统一图学习与文本信息,提升会话搜索性能。 |
contrastive learning large language model |
|
|
| 14 |
Scaling Vision Mamba Across Resolutions via Fractal Traversal |
FractalMamba++:提出基于分形遍历的视觉Mamba,提升跨分辨率适应性 |
Mamba |
|
|
| 15 |
Physics-Driven Local-Whole Elastic Deformation Modeling for Point Cloud Representation Learning |
提出物理驱动的局部-整体弹性变形建模以提升点云表示学习 |
representation learning |
|
|
| 16 |
Ground-V: Teaching VLMs to Ground Complex Instructions in Pixels |
Ground-V:通过像素级指令微调,提升VLM在复杂场景下的定位能力 |
distillation instruction following |
|
|