| 1 |
Vision-DeepResearch: Incentivizing DeepResearch Capability in Multimodal Large Language Models |
Vision-DeepResearch:通过多轮多实体多尺度搜索,提升多模态大语言模型在复杂视觉任务中的表现。 |
large language model foundation model multimodal |
✅ |
|
| 2 |
RSGround-R1: Rethinking Remote Sensing Visual Grounding through Spatial Reasoning |
提出RSGround-R1以解决遥感视觉定位中的空间推理问题 |
large language model multimodal visual grounding |
|
|
| 3 |
Thinker: A vision-language foundation model for embodied intelligence |
Thinker:面向具身智能的视觉-语言基础模型,解决机器人感知与推理难题 |
foundation model visual grounding chain-of-thought |
|
|
| 4 |
UEval: A Benchmark for Unified Multimodal Generation |
UEval:一个用于评估统一多模态生成模型的基准测试。 |
large language model multimodal |
|
|
| 5 |
MMFineReason: Closing the Multimodal Reasoning Gap via Open Data-Centric Methods |
MMFineReason:通过开放数据中心方法弥合多模态推理差距 |
multimodal chain-of-thought |
|
|
| 6 |
CG-MLLM: Captioning and Generating 3D content via Multi-modal Large Language Models |
提出CG-MLLM以解决3D内容生成的低分辨率问题 |
large language model multimodal |
|
|
| 7 |
MultiModal Fine-tuning with Synthetic Captions |
提出基于多模态大语言模型生成合成字幕的多模态微调方法,提升图像分类性能。 |
large language model multimodal |
✅ |
|
| 8 |
Understanding Multimodal Complementarity for Single-Frame Action Anticipation |
提出AAG+单帧动作预测框架,融合多模态信息,性能媲美视频方法。 |
multimodal |
|
|
| 9 |
VideoAesBench: Benchmarking the Video Aesthetics Perception Capabilities of Large Multimodal Models |
VideoAesBench:用于评估大型多模态模型视频美学感知能力的综合基准测试。 |
multimodal |
|
|
| 10 |
When Gradient Optimization Is Not Enough: $\dagger$ Dispersive and Anchoring Geometric Regularizer for Multimodal Learning |
提出Dispersive and Anchoring Geometric Regularizer,解决多模态学习中的几何结构病态问题。 |
multimodal |
|
|
| 11 |
Hypernetwork-Based Adaptive Aggregation for Multimodal Multiple-Instance Learning in Predicting Coronary Calcium Debulking |
提出基于超网络的自适应聚合Transformer,用于预测冠状动脉钙化消融术的需求。 |
multimodal |
✅ |
|
| 12 |
Generation Enhances Understanding in Unified Multimodal Models via Multi-Representation Generation |
UniMRG:通过多表示生成增强统一多模态模型的理解能力 |
multimodal |
|
|
| 13 |
Do Pathology Foundation Models Encode Disease Progression? A Pseudotime Analysis of Visual Representations |
病理学预训练模型通过表征空间中的伪时间分析编码疾病进展 |
foundation model |
|
|
| 14 |
ChartE$^{3}$: A Comprehensive Benchmark for End-to-End Chart Editing |
提出ChartE$^{3}$基准,用于端到端图表编辑的全面评估与能力提升。 |
large language model multimodal |
|
|
| 15 |
LAMP: Learning Universal Adversarial Perturbations for Multi-Image Tasks via Pre-trained Models |
LAMP:通过预训练模型学习多图像任务的通用对抗扰动 |
large language model multimodal |
|
|
| 16 |
Dynamic Topology Awareness: Breaking the Granularity Rigidity in Vision-Language Navigation |
提出DGNav,解决视觉-语言导航中拓扑地图粒度刚性问题,提升导航性能。 |
VLN |
✅ |
|
| 17 |
OCRVerse: Towards Holistic OCR in End-to-End Vision-Language Models |
提出OCRVerse,实现端到端视觉语言模型中的整体OCR,统一处理文本和视觉元素。 |
multimodal |
|
|
| 18 |
Spava: Accelerating Long-Video Understanding via Sequence-Parallelism-aware Approximate Attention |
Spava:通过序列并行近似注意力加速长视频理解 |
multimodal |
✅ |
|
| 19 |
MPF-Net: Exposing High-Fidelity AI-Generated Video Forgeries via Hierarchical Manifold Deviation and Micro-Temporal Fluctuations |
MPF-Net:通过分层流形偏差与微观时间波动揭示高保真AI生成视频伪造 |
foundation model |
|
|