| 1 |
Language Integration in Fine-Tuning Multimodal Large Language Models for Image-Based Regression |
提出RvTC框架,结合数据特定提示,提升多模态大模型在图像回归任务中的性能 |
large language model multimodal |
|
|
| 2 |
Med-GRIM: Enhanced Zero-Shot Medical VQA using prompt-embedded Multimodal Graph RAG |
Med-GRIM:利用提示嵌入多模态图RAG增强零样本医学VQA |
large language model multimodal |
✅ |
|
| 3 |
Light Future: Multimodal Action Frame Prediction via InstructPix2Pix |
提出基于InstructPix2Pix的轻量级多模态动作帧预测方法,用于机器人任务。 |
multimodal |
|
|
| 4 |
TriCLIP-3D: A Unified Parameter-Efficient Framework for Tri-Modal 3D Visual Grounding based on CLIP |
TriCLIP-3D:基于CLIP的统一参数高效三模态3D视觉定位框架 |
visual grounding |
|
|
| 5 |
BleedOrigin: Dynamic Bleeding Source Localization in Endoscopic Submucosal Dissection via Dual-Stage Detection and Tracking |
BleedOrigin-Net:用于内镜黏膜下剥离术中动态出血源定位的双阶段检测跟踪框架 |
large language model multimodal |
|
|
| 6 |
LeAdQA: LLM-Driven Context-Aware Temporal Grounding for Video Question Answering |
LeAdQA:利用LLM驱动的上下文感知时序定位解决视频问答难题 |
multimodal visual grounding |
|
|
| 7 |
Towards Video Thinking Test: A Holistic Benchmark for Advanced Video Reasoning and Understanding |
提出Video-TT:一个用于评估视频LLM高级推理和理解能力的综合基准 |
large language model |
|
|
| 8 |
Grounding Degradations in Natural Language for All-In-One Video Restoration |
提出一种基于自然语言语义引导的端到端视频修复框架,无需预知退化类型。 |
foundation model |
|
|