| 1 |
Look Carefully: Adaptive Visual Reinforcements in Multimodal Large Language Models for Hallucination Mitigation |
提出自适应视觉增强AIR框架,缓解多模态大语言模型中的幻觉问题 |
large language model multimodal |
|
|
| 2 |
GuardAlign: Test-time Safety Alignment in Multimodal Large Language Models |
GuardAlign:多模态大语言模型中基于测试时对齐的安全防御框架 |
large language model multimodal |
|
|
| 3 |
Venus: Benchmarking and Empowering Multimodal Large Language Models for Aesthetic Guidance and Cropping |
Venus:提升多模态大语言模型的美学指导与裁剪能力 |
large language model multimodal |
✅ |
|
| 4 |
Multimodal Optimal Transport for Unsupervised Temporal Segmentation in Surgical Robotics |
提出TASOT,利用多模态最优传输实现手术机器人视频的无监督时序分割 |
multimodal zero-shot transfer |
✅ |
|
| 5 |
A multimodal slice discovery framework for systematic failure detection and explanation in medical image classification |
提出多模态切片发现框架,用于医学图像分类中系统性错误检测与解释 |
multimodal |
|
|
| 6 |
PointCoT: A Multi-modal Benchmark for Explicit 3D Geometric Reasoning |
PointCoT:提出用于3D几何推理的多模态基准,解决MLLM在点云理解中的几何幻觉问题。 |
large language model multimodal chain-of-thought |
|
|
| 7 |
Breaking the Data Barrier: Robust Few-Shot 3D Vessel Segmentation using Foundation Models |
利用预训练模型,提出一种鲁棒的小样本3D血管分割方法,有效应对数据匮乏和领域迁移问题。 |
foundation model |
|
|
| 8 |
Any Model, Any Place, Any Time: Get Remote Sensing Foundation Model Embeddings On Demand |
rs-embed:遥感基础模型嵌入按需获取,解决异构性难题 |
foundation model |
✅ |
|
| 9 |
Suppressing Prior-Comparison Hallucinations in Radiology Report Generation via Semantically Decoupled Latent Steering |
提出SDLS方法,通过语义解耦潜在空间引导,抑制放射报告生成中的历史对比幻觉。 |
large language model foundation model zero-shot transfer |
|
|
| 10 |
Can Unified Generation and Understanding Models Maintain Semantic Equivalence Across Different Output Modalities? |
VGUBench揭示了统一多模态大模型在跨模态语义对齐上的不足 |
large language model multimodal |
|
|
| 11 |
HiDrop: Hierarchical Vision Token Reduction in MLLMs via Late Injection, Concave Pyramid Pruning, and Early Exit |
HiDrop:通过分层视觉Token缩减提升多模态大语言模型的效率。 |
large language model multimodal |
✅ |
|
| 12 |
Steering and Rectifying Latent Representation Manifolds in Frozen Multi-modal LLMs for Video Anomaly Detection |
提出SteerVAD,通过引导和修正冻结多模态LLM中的潜在表征流形,解决视频异常检测问题。 |
large language model |
|
|
| 13 |
Interpretable Debiasing of Vision-Language Models for Social Fairness |
提出DeBiasLens,通过可解释方式消除视觉-语言模型中的社会偏见 |
multimodal |
|
|
| 14 |
Ref-Adv: Exploring MLLM Visual Reasoning in Referring Expression Tasks |
提出Ref-Adv基准,揭示MLLM在指代表达理解中视觉推理的局限性 |
multimodal |
|
|
| 15 |
UTPTrack: Towards Simple and Unified Token Pruning for Visual Tracking |
UTPTrack:提出一种简单统一的Token剪枝框架,用于提升视觉跟踪效率。 |
multimodal |
✅ |
|
| 16 |
DLEBench: Evaluating Small-scale Object Editing Ability for Instruction-based Image Editing Model |
提出DLEBench,用于评估指令驱动图像编辑模型在小目标编辑上的能力。 |
instruction following |
|
|