| 1 |
EDVD-LLaMA: Explainable Deepfake Video Detection via Multimodal Large Language Model Reasoning |
提出EDVD-LLaMA框架,通过多模态大语言模型推理实现可解释的深度伪造视频检测。 |
large language model multimodal chain-of-thought |
✅ |
|
| 2 |
VisionSelector: End-to-End Learnable Visual Token Compression for Efficient Multimodal LLMs |
VisionSelector:端到端可学习的视觉Token压缩,提升多模态LLM效率 |
large language model multimodal |
✅ |
|
| 3 |
Universal and Transferable Attacks on Pathology Foundation Models |
提出通用可迁移对抗扰动UTAP,揭示病理学Foundation模型的脆弱性 |
foundation model |
|
|
| 4 |
PRISMM-Bench: A Benchmark of Peer-Review Grounded Multimodal Inconsistencies |
提出PRISMM-Bench以解决多模态科学论文中的不一致性问题 |
multimodal |
|
|
| 5 |
Structured Interfaces for Automated Reasoning with 3D Scene Graphs |
提出基于结构化接口的3D场景图推理方法,提升LLM在机器人自然语言理解中的性能。 |
large language model instruction following |
|
|
| 6 |
NavQ: Learning a Q-Model for Foresighted Vision-and-Language Navigation |
NavQ:学习Q-模型以实现具有前瞻性的视觉-语言导航 |
VLN |
|
|
| 7 |
VIPAMIN: Visual Prompt Initialization via Embedding Selection and Subspace Expansion |
VIPAMIN:通过嵌入选择和子空间扩展实现视觉Prompt初始化,提升自监督模型在下游任务的性能。 |
foundation model |
✅ |
|
| 8 |
iWatchRoadv2: Pothole Detection, Geospatial Mapping, and Intelligent Road Governance |
iWatchRoadv2:提出基于YOLO的道路坑洼实时检测、地理空间映射与智能道路治理平台。 |
TAMP |
|
|
| 9 |
Cerberus: Real-Time Video Anomaly Detection via Cascaded Vision-Language Models |
Cerberus:基于级联视觉-语言模型的实时视频异常检测系统 |
visual grounding |
|
|