| 1 |
Solving Zero-Shot 3D Visual Grounding as Constraint Satisfaction Problems |
提出基于约束满足问题的零样本3D视觉定位方法,提升复杂场景理解能力。 |
large language model visual grounding |
✅ |
|
| 2 |
Panther: Illuminate the Sight of Multimodal LLMs with Instruction-Guided Visual Prompts |
Panther:利用指令引导的视觉提示增强多模态LLM的视觉感知能力 |
large language model multimodal |
|
|
| 3 |
A Multimodal Approach to The Detection and Classification of Skin Diseases |
提出多模态皮肤病检测与分类方法,结合图像与文本信息提升诊断准确率。 |
large language model multimodal |
|
|
| 4 |
GMAI-VL & GMAI-VL-5.5M: A Large Vision-Language Model and A Comprehensive Multimodal Dataset Towards General Medical AI |
提出GMAI-VL,一个基于大规模多模态医学数据集的通用医学视觉-语言模型 |
multimodal |
|
|
| 5 |
Multimodal 3D Brain Tumor Segmentation with Adversarial Training and Conditional Random Field |
提出基于对抗训练和条件随机场的3D多模态脑肿瘤分割方法 |
multimodal |
|
|
| 6 |
Multimodal Autoregressive Pre-training of Large Vision Encoders |
提出AIMV2:一种基于多模态自回归预训练的大规模视觉编码器,显著提升下游任务性能。 |
multimodal |
|
|
| 7 |
Looking Beyond Text: Reducing Language bias in Large Vision-Language Models via Multimodal Dual-Attention and Soft-Image Guidance |
提出LACING框架,通过多模态双重注意力与软图像引导减少大型视觉语言模型中的语言偏见。 |
multimodal |
|
|
| 8 |
SMoLoRA: Exploring and Defying Dual Catastrophic Forgetting in Continual Visual Instruction Tuning |
SMoLoRA:探索并解决持续视觉指令微调中的双重灾难性遗忘问题 |
large language model multimodal instruction following |
✅ |
|
| 9 |
Beyond Training: Dynamic Token Merging for Zero-Shot Video Understanding |
提出DYTO:一种动态Token融合框架,用于零样本视频理解。 |
large language model multimodal |
|
|
| 10 |
FoPru: Focal Pruning for Efficient Large Vision-Language Models |
提出FoPru:基于注意力机制的焦点剪枝,提升大规模视觉语言模型效率 |
large language model multimodal |
|
|
| 11 |
LLaVA-MR: Large Language-and-Vision Assistant for Video Moment Retrieval |
提出LLaVA-MR,利用多模态大语言模型解决视频片段检索难题。 |
large language model multimodal |
|
|
| 12 |
FocusLLaVA: A Coarse-to-Fine Approach for Efficient and Effective Visual Token Compression |
FocusLLaVA:一种粗到细的视觉Token压缩方法,提升多模态大模型的效率和性能 |
large language model |
|
|
| 13 |
Quantization without Tears |
提出QwT,通过轻量级线性层结构实现高效、通用且高精度的网络量化。 |
multimodal |
✅ |
|