| 1 |
Towards Interpretable Foundation Models for Retinal Fundus Images |
提出Dual-IFM,用于视网膜眼底图像的可解释性基础模型 |
foundation model |
|
|
| 2 |
LVOmniBench: Pioneering Long Audio-Video Understanding Evaluation for Omnimodal LLMs |
LVOmniBench:首个面向全模态LLM的长音频视频理解评测基准 |
large language model multimodal |
|
|
| 3 |
CoDA: Exploring Chain-of-Distribution Attacks and Post-Hoc Token-Space Repair for Medical Vision-Language Models |
提出CoDA框架,评估并提升医学视觉-语言模型在临床流程中的鲁棒性。 |
large language model multimodal |
|
|
| 4 |
To See or To Please: Uncovering Visual Sycophancy and Split Beliefs in VLMs |
提出三层诊断框架,揭示视觉语言模型中的视觉迎合现象和分裂信念 |
instruction following visual grounding |
|
|
| 5 |
Cubic Discrete Diffusion: Discrete Visual Generation on High-Dimensional Representation Tokens |
提出CubiD:首个面向高维离散表示的扩散模型,用于视觉生成任务。 |
multimodal |
✅ |
|
| 6 |
Tinted Frames: Question Framing Blinds Vision-Language Models |
揭示视觉语言模型对问题框架的敏感性,并提出提示调优方法以提升视觉关注。 |
visual grounding |
|
|
| 7 |
SAVeS: Steering Safety Judgments in Vision-Language Models via Semantic Cues |
SAVeS:通过语义线索操纵视觉-语言模型中的安全判断 |
multimodal |
|
|
| 8 |
SignAgent: Agentic LLMs for Linguistically-Grounded Sign Language Annotation and Dataset Curation |
SignAgent:利用Agentic LLM进行语言学驱动的手语标注与数据集构建 |
large language model |
|
|
| 9 |
SwiftTailor: Efficient 3D Garment Generation with Geometry Image Representation |
SwiftTailor:利用几何图像表示高效生成3D服装 |
multimodal |
|
|
| 10 |
SEM: Sparse Embedding Modulation for Post-Hoc Debiasing of Vision-Language Models |
提出稀疏嵌入调制(SEM),用于视觉-语言模型的事后去偏。 |
multimodal |
|
|
| 11 |
Rethinking MLLM Itself as a Segmenter with a Single Segmentation Token |
提出SELF1E,仅用单个分割token实现多模态大语言模型(MLLM)的无解码器图像分割。 |
large language model |
✅ |
|
| 12 |
Motion-o: Trajectory-Grounded Video Reasoning |
提出Motion-o,通过显式轨迹推理增强视频理解中的时空推理能力。 |
chain-of-thought |
✅ |
|
| 13 |
Dual-Model Prediction of Affective Engagement and Vocal Attractiveness from Speaker Expressiveness in Video Learning |
提出一种基于说话人情感表达的双模型,用于预测视频学习中的情感参与度和声音吸引力。 |
multimodal |
|
|
| 14 |
Click-to-Ask: An AI Live Streaming Assistant with Offline Copywriting and Online Interactive QA |
提出Click-to-Ask,用于直播电商的AI助手,实现离线文案生成与在线互动问答。 |
multimodal |
|
|
| 15 |
T-QPM: Enabling Temporal Out-Of-Distribution Detection and Domain Generalization for Vision-Language Models in Open-World |
提出T-QPM框架,增强视觉-语言模型在动态开放世界中的OOD检测和领域泛化能力 |
multimodal |
|
|
| 16 |
Gastric-X: A Multimodal Multi-Phase Benchmark Dataset for Advancing Vision-Language Models in Gastric Cancer Analysis |
Gastric-X:用于胃癌分析的多模态多阶段基准数据集,促进视觉-语言模型发展。 |
multimodal |
|
|
| 17 |
Instruction-Free Tuning of Large Vision Language Models for Medical Instruction Following |
提出免指令调优方法,提升医学视觉语言模型在指令跟随任务上的性能。 |
instruction following |
|
|
| 18 |
ReXInTheWild: A Unified Benchmark for Medical Photograph Understanding |
提出ReXInTheWild,用于评估视觉-语言模型理解医学照片的统一基准 |
large language model multimodal |
|
|
| 19 |
Narrative Aligned Long Form Video Question Answering |
提出NA-VQA基准和Video-NaRA框架,解决长视频叙事推理难题 |
large language model multimodal |
|
|
| 20 |
Tinted Frames: Question Framing Blinds Vision-Language Models |
揭示视觉语言模型对问题框架的敏感性,并提出提示调优方法提升视觉 grounding。 |
visual grounding |
|
|