| 1 |
VP-Bench: A Comprehensive Benchmark for Visual Prompting in Multimodal Large Language Models |
VP-Bench:多模态大语言模型中视觉提示理解能力的综合评测基准 |
large language model multimodal |
|
|
| 2 |
MicroVQA++: High-Quality Microscopy Reasoning Dataset with Weakly Supervised Graphs for Multimodal Large Language Model |
提出MicroVQA++:一个高质量显微镜推理数据集,利用弱监督图进行多模态大语言模型训练。 |
large language model multimodal |
|
|
| 3 |
Seeing the Forest and the Trees: Query-Aware Tokenizer for Long-Video Multimodal Language Models |
提出QTSplus以解决长视频理解中的视觉信息选择问题 |
large language model multimodal |
|
|
| 4 |
Q-Doc: Benchmarking Document Image Quality Assessment Capabilities in Multi-modal Large Language Models |
Q-Doc:评估多模态大语言模型在文档图像质量评估中的能力 |
large language model chain-of-thought |
✅ |
|
| 5 |
AUVIC: Adversarial Unlearning of Visual Concepts for Multi-modal Large Language Models |
AUVIC:面向多模态大语言模型的视觉概念对抗性遗忘框架 |
large language model multimodal |
|
|
| 6 |
MAFM^3: Modular Adaptation of Foundation Models for Multi-Modal Medical AI |
MAFM^3:用于多模态医学AI的基础模型模块化适配框架 |
foundation model multimodal |
✅ |
|
| 7 |
CrossMed: A Multimodal Cross-Task Benchmark for Compositional Generalization in Medical Imaging |
CrossMed:一个用于评估医学影像中组合泛化能力的多模态跨任务基准 |
large language model multimodal |
|
|
| 8 |
Multimodal Posterior Sampling-based Uncertainty in PD-L1 Segmentation from H&E Images |
提出基于多模态后验采样的nnUNet-B,用于H&E图像PD-L1分割及不确定性估计。 |
multimodal |
|
|
| 9 |
ImAgent: A Unified Multimodal Agent Framework for Test-Time Scalable Image Generation |
提出ImAgent:一种统一的多模态Agent框架,用于测试时可扩展的图像生成。 |
multimodal |
|
|
| 10 |
Synergy vs. Noise: Performance-Guided Multimodal Fusion For Biochemical Recurrence-Free Survival in Prostate Cancer |
提出性能引导的多模态融合方法,提升前列腺癌生化复发预测精度 |
multimodal |
|
|
| 11 |
The Persistence of Cultural Memory: Investigating Multimodal Iconicity in Diffusion Models |
提出多模态标志性评估框架,用于分析扩散模型中的文化记忆持久性 |
multimodal |
|
|
| 12 |
DocSLM: A Small Vision-Language Model for Long Multimodal Document Understanding |
提出DocSLM,一种面向资源受限边缘设备的长文档理解小规模视觉语言模型 |
multimodal |
✅ |
|
| 13 |
Positional Bias in Multimodal Embedding Models: Do They Favor the Beginning, the Middle, or the End? |
揭示多模态嵌入模型中的位置偏差:文本偏向起始,图像偏向两端 |
multimodal |
|
|
| 14 |
Toward Generalized Detection of Synthetic Media: Limitations, Challenges, and the Path to Multimodal Solutions |
综述AI合成媒体检测局限性与挑战,提出多模态深度学习解决方案的研究方向。 |
multimodal |
|
|
| 15 |
EmoVid: A Multimodal Emotion Video Dataset for Emotion-Centric Video Understanding and Generation |
EmoVid:首个多模态情感视频数据集,用于情感中心视频理解与生成。 |
multimodal |
|
|
| 16 |
Out-of-Distribution Detection with Positive and Negative Prompt Supervision Using Large Language Models |
提出正负提示监督以提升OOD检测性能 |
large language model |
|
|
| 17 |
PhaseWin Search Framework Enable Efficient Object-Level Interpretation |
PhaseWin:一种高效的对象级解释框架,实现近线性复杂度的忠实区域归因 |
foundation model multimodal visual grounding |
|
|
| 18 |
AccKV: Towards Efficient Audio-Video LLMs Inference via Adaptive-Focusing and Cross-Calibration KV Cache Optimization |
AccKV:面向高效音视频LLM推理的自适应聚焦与交叉校准KV缓存优化 |
large language model multimodal |
|
|
| 19 |
S2D-ALIGN: Shallow-to-Deep Auxiliary Learning for Anatomically-Grounded Radiology Report Generation |
提出S2D-Align,通过浅层到深层的辅助学习,实现解剖学相关的放射报告生成。 |
large language model multimodal |
|
|
| 20 |
Draft and Refine with Visual Experts |
提出Draft and Refine框架,提升LVLM视觉信息利用率,减少幻觉 |
multimodal visual grounding |
✅ |
|
| 21 |
Φeat: Physically-Grounded Feature Representation |
提出Φeat:一种物理可解释的视觉特征表示方法,提升材质识别能力。 |
foundation model |
|
|
| 22 |
Beyond Flatlands: Unlocking Spatial Intelligence by Decoupling 3D Reasoning from Numerical Regression |
GEODE:解耦3D推理与数值回归,提升视觉语言模型空间智能 |
chain-of-thought |
|
|
| 23 |
Stroke Modeling Enables Vectorized Character Generation with Large Vectorized Glyph Model |
提出基于笔画建模的大型矢量字形模型LVGM,实现矢量化字符生成 |
large language model |
|
|
| 24 |
PAS: A Training-Free Stabilizer for Temporal Encoding in Video LLMs |
PAS:一种免训练的视频LLM时间编码稳定器,解决时间不一致性问题 |
multimodal |
|
|