| 1 |
FishNet++: Analyzing the capabilities of Multimodal Large Language Models in marine biology |
FishNet++:评估多模态大语言模型在海洋生物学中的能力,并构建大规模基准数据集。 |
large language model multimodal |
|
|
| 2 |
MMRQA: Signal-Enhanced Multimodal Large Language Models for MRI Quality Assessment |
提出MMRQA框架,融合信号处理与多模态大语言模型,提升MRI质量评估 |
large language model multimodal |
|
|
| 3 |
Toward a Vision-Language Foundation Model for Medical Data: Multimodal Dataset and Benchmarks for Vietnamese PET/CT Report Generation |
提出ViPET-ReportGen数据集与基准,促进越南语PET/CT报告生成医学视觉-语言基础模型研究。 |
foundation model multimodal |
✅ |
|
| 4 |
EVLF-FM: Explainable Vision Language Foundation Model for Medicine |
提出EVLF-FM,一种具备可解释性的医学视觉语言基础模型,用于多疾病诊断和视觉问答。 |
foundation model multimodal visual grounding |
|
|
| 5 |
LLM-RG: Referential Grounding in Outdoor Scenarios using Large Language Models |
LLM-RG:利用大语言模型实现户外场景下的指代表达式定位 |
large language model chain-of-thought |
|
|
| 6 |
GHOST: Hallucination-Inducing Image Generation for Multimodal LLMs |
GHOST:通过诱导幻觉的图像生成方法,用于压力测试多模态LLM |
large language model multimodal |
|
|
| 7 |
Mitigating Hallucination in Multimodal LLMs with Layer Contrastive Decoding |
提出LayerCD,通过层对比解码缓解多模态LLM中的幻觉问题 |
large language model multimodal |
✅ |
|
| 8 |
OIG-Bench: A Multi-Agent Annotated Benchmark for Multimodal One-Image Guides Understanding |
OIG-Bench:提出多智能体标注的多模态单图指南理解评测基准 |
large language model multimodal |
✅ |
|
| 9 |
Vision Function Layer in Multimodal LLMs |
揭示多模态LLM视觉功能层,实现高效可定制模型 |
large language model multimodal |
|
|
| 10 |
Multimodal Arabic Captioning with Interpretable Visual Concept Integration |
VLCAP:一种结合可解释视觉概念集成的多模态阿拉伯语图像描述框架 |
multimodal |
|
|
| 11 |
VideoAnchor: Reinforcing Subspace-Structured Visual Cues for Coherent Visual-Spatial Reasoning |
VideoAnchor:通过强化子空间结构视觉线索实现连贯的视觉-空间推理 |
large language model multimodal visual grounding |
✅ |
|
| 12 |
A Scalable Distributed Framework for Multimodal GigaVoxel Image Registration |
提出FFDP框架,实现前所未有的十亿体素多模态图像配准 |
multimodal |
|
|
| 13 |
Robust Multimodal Semantic Segmentation with Balanced Modality Contributions |
提出EQUISeg,通过平衡模态贡献提升多模态语义分割的鲁棒性 |
multimodal |
|
|
| 14 |
Uni-X: Mitigating Modality Conflict with a Two-End-Separated Architecture for Unified Multimodal Models |
提出Uni-X架构,通过两端分离结构缓解多模态统一模型中的模态冲突问题 |
multimodal |
✅ |
|
| 15 |
Seeing Before Reasoning: A Unified Framework for Generalizable and Explainable Fake Image Detection |
提出Forensic-Chat框架,解决多模态大语言模型在伪造图像检测中泛化性和可解释性不足的问题。 |
large language model multimodal |
|
|
| 16 |
PixelCraft: A Multi-Agent System for High-Fidelity Visual Reasoning on Structured Images |
PixelCraft:用于结构化图像高保真视觉推理的多智能体系统 |
large language model multimodal |
✅ |
|
| 17 |
VT-FSL: Bridging Vision and Text with LLMs for Few-Shot Learning |
提出VT-FSL框架,利用LLM桥接视觉与文本,提升小样本学习性能。 |
large language model multimodal |
✅ |
|
| 18 |
Environment-Aware Satellite Image Generation with Diffusion Models |
提出环境感知扩散模型,用于生成高质量、环境相关的卫星图像。 |
foundation model multimodal |
|
|
| 19 |
FreeRet: MLLMs as Training-Free Retrievers |
提出FreeRet框架以实现无训练的多模态检索 |
large language model multimodal |
|
|
| 20 |
Euclid's Gift: Enhancing Spatial Perception and Reasoning in Vision-Language Models via Geometric Surrogate Tasks |
提出Euclid30K数据集并微调视觉语言模型,显著提升其空间感知与推理能力 |
large language model multimodal |
✅ |
|
| 21 |
UI2V-Bench: An Understanding-based Image-to-video Generation Benchmark |
提出UI2V-Bench以解决图像到视频生成的语义理解问题 |
large language model multimodal |
|
|
| 22 |
VISOR++: Universal Visual Inputs based Steering for Large Vision Language Models |
VISOR++:基于通用视觉输入的视觉语言模型行为引导方法 |
multimodal |
|
|
| 23 |
Perceive, Reflect and Understand Long Video: Progressive Multi-Granular Clue Exploration with Interactive Agents |
CogniGPT:交互式多粒度线索探索框架,用于高效长视频理解 |
large language model |
|
|
| 24 |
Training-Free Token Pruning via Zeroth-Order Gradient Estimation in Vision-Language Models |
提出无训练的令牌剪枝方法以降低视觉语言模型的推理成本 |
multimodal |
|
|
| 25 |
Instruction Guided Multi Object Image Editing with Quantity and Layout Consistency |
提出QL-Adapter,解决多对象图像编辑中数量和布局一致性问题 |
instruction following |
|
|