| 1 |
DR$^2$Seg: Decomposed Two-Stage Rollouts for Efficient Reasoning Segmentation in Multimodal Large Language Models |
提出DR$^2$Seg框架,提升多模态大语言模型在推理分割任务中的效率与精度。 |
large language model multimodal |
|
|
| 2 |
Handling Missing Modalities in Multimodal Survival Prediction for Non-Small Cell Lung Cancer |
提出一种缺失感知的多模态生存预测框架,用于解决非小细胞肺癌中数据缺失问题。 |
foundation model multimodal |
|
|
| 3 |
ROMA: Real-time Omni-Multimodal Assistant with Interactive Streaming Understanding |
ROMA:用于交互式流式理解的实时全模态助手 |
large language model multimodal |
|
|
| 4 |
See Less, Drive Better: Generalizable End-to-End Autonomous Driving via Foundation Models Stochastic Patch Selection |
提出基于随机patch选择的通用端到端自动驾驶方法,提升泛化性和效率。 |
foundation model |
|
|
| 5 |
Hierarchical Refinement of Universal Multimodal Attacks on Vision-Language Models |
提出层级细化的通用多模态攻击框架HRA,提升视觉-语言模型的鲁棒性 |
multimodal |
|
|
| 6 |
Advancing Adaptive Multi-Stage Video Anomaly Reasoning: A Benchmark Dataset and Method |
提出视频异常推理任务与数据集,并设计自适应多阶段推理模型Vad-R1-Plus |
large language model multimodal chain-of-thought |
|
|
| 7 |
V-Zero: Self-Improving Multimodal Reasoning with Zero Annotation |
V-Zero:一种基于无标注数据的多模态自提升推理框架 |
multimodal |
✅ |
|
| 8 |
VERHallu: Evaluating and Mitigating Event Relation Hallucination in Video Large Language Models |
提出VERHallu基准评测并设计KFP策略,缓解视频大语言模型中的事件关系幻觉问题 |
large language model |
|
|
| 9 |
Fine-Grained Human Pose Editing Assessment via Layer-Selective MLLMs |
提出基于层选择多模态大语言模型的细粒度人体姿态编辑评估方法 |
large language model multimodal |
|
|