| 1 |
Vector-Quantized Vision Foundation Models for Object-Centric Learning |
提出VQ-VFM-OCL,通过共享量化视觉基础模型表示,提升面向对象学习的性能。 |
foundation model |
✅ |
|
| 2 |
Do computer vision foundation models learn the low-level characteristics of the human visual system? |
评估计算机视觉基础模型与人类视觉系统在低级特征上的相似性 |
foundation model |
|
|
| 3 |
Multimodal Representation Alignment for Image Generation: Text-Image Interleaved Control Is Easier Than You Think |
提出Dream Engine,实现文本-图像交错控制的图像生成统一框架 |
multimodal |
|
|
| 4 |
Rethinking Multimodal Learning from the Perspective of Mitigating Classification Ability Disproportion |
提出一种基于Boosting的多模态学习方法,缓解分类能力不均衡问题。 |
multimodal |
✅ |
|
| 5 |
Joint Fusion and Encoding: Advancing Multimodal Retrieval from the Ground Up |
提出联合融合编码框架,从底层增强多模态检索性能 |
multimodal |
|
|
| 6 |
Can Large Language Models Unveil the Mysteries? An Exploration of Their Ability to Unlock Information in Complex Scenarios |
提出CVQA和CPVQA基准,揭示大语言模型在复杂场景组合推理中的局限性 |
large language model |
|
|
| 7 |
C-Drag: Chain-of-Thought Driven Motion Controller for Video Generation |
提出C-Drag,通过思维链驱动的运动控制器实现更精细的可控视频生成。 |
chain-of-thought |
✅ |
|
| 8 |
Reliable Multimodal Learning Via Multi-Level Adaptive DeConfusion |
提出多层自适应解混淆方法,提升多模态学习在噪声环境下的可靠性。 |
multimodal |
|
|
| 9 |
Visual Reasoning at Urban Intersections: FineTuning GPT-4o for Traffic Conflict Detection |
微调GPT-4o用于城市路口交通冲突检测,提升视觉推理能力 |
large language model multimodal |
✅ |
|
| 10 |
CoCa-CXR: Contrastive Captioners Learn Strong Temporal Structures for Chest X-Ray Vision-Language Understanding |
CoCa-CXR:对比式图像描述模型学习胸部X光片视觉-语言理解中的时间结构 |
large language model foundation model |
|
|
| 11 |
VideoA11y: Method and Dataset for Accessible Video Description |
VideoA11y:提出了一种利用多模态大语言模型生成可访问视频描述的方法与数据集。 |
large language model multimodal |
✅ |
|
| 12 |
New Dataset and Methods for Fine-Grained Compositional Referring Expression Comprehension via Specialist-MLLM Collaboration |
提出基于专家模型与MLLM协作的细粒度组合指代表达式理解方法与数据集 |
large language model multimodal |
✅ |
|
| 13 |
AsymLoRA: Harmonizing Data Conflicts and Commonalities in MLLMs |
提出AsymLoRA,通过非对称LoRA协调MLLM中数据冲突与共性,提升多模态任务性能。 |
large language model multimodal |
✅ |
|
| 14 |
Improving Adversarial Transferability in MLLMs via Dynamic Vision-Language Alignment Attack |
提出动态视觉-语言对齐攻击(DynVLA),提升MLLM对抗攻击的迁移性 |
large language model multimodal |
|
|
| 15 |
Interpreting CLIP with Hierarchical Sparse Autoencoders |
提出Matryoshka SAE,用于CLIP模型的可解释性分析与控制。 |
multimodal |
✅ |
|
| 16 |
Visual Adaptive Prompting for Compositional Zero-Shot Learning |
提出视觉自适应提示系统VAPS,解决组合零样本学习中视觉信息利用不足的问题。 |
multimodal |
|
|
| 17 |
Avat3r: Large Animatable Gaussian Reconstruction Model for High-fidelity 3D Head Avatars |
Avat3r:基于高斯重建的大型可动画3D头部Avatar模型,仅需少量输入图像。 |
foundation model |
✅ |
|
| 18 |
ReCon: Enhancing True Correspondence Discrimination through Relation Consistency for Robust Noisy Correspondence Learning |
ReCon:通过关系一致性增强真对应判别,实现鲁棒的噪声对应学习 |
multimodal |
✅ |
|
| 19 |
One Model for ALL: Low-Level Task Interaction Is a Key to Task-Agnostic Image Fusion |
提出GIFNet,利用低级视觉任务交互实现任务无关的图像融合 |
multimodal |
✅ |
|
| 20 |
ProAPO: Progressively Automatic Prompt Optimization for Visual Classification |
提出ProAPO以解决视觉分类中的提示优化问题 |
large language model |
|
|