| 1 |
Towards Zero-Shot Differential Morphing Attack Detection with Multimodal Large Language Models |
利用多模态大语言模型实现零样本差分人脸合成攻击检测 |
large language model multimodal chain-of-thought |
|
|
| 2 |
LENS: Multi-level Evaluation of Multimodal Reasoning with Large Language Models |
LENS:多层次评估大型语言模型多模态推理能力 |
large language model multimodal |
✅ |
|
| 3 |
Visual Thoughts: A Unified Perspective of Understanding Multimodal Chain-of-Thought |
揭示多模态思维链中视觉思想的作用机制,提升LVLMs的推理能力。 |
multimodal chain-of-thought |
|
|
| 4 |
CP-LLM: Context and Pixel Aware Large Language Model for Video Quality Assessment |
CP-LLM:上下文与像素感知的大语言模型用于视频质量评估 |
large language model multimodal |
|
|
| 5 |
Exploring The Visual Feature Space for Multimodal Neural Decoding |
提出基于多模态大语言模型的零样本神经解码方法,提升视觉特征空间利用率。 |
large language model multimodal |
✅ |
|
| 6 |
The P$^3$ dataset: Pixels, Points and Polygons for Multimodal Building Vectorization |
提出P³数据集,用于多模态建筑物矢量化,融合像素、点云和多边形信息 |
multimodal |
✅ |
|
| 7 |
Multimodal Conditional Information Bottleneck for Generalizable AI-Generated Image Detection |
提出多模态条件信息瓶颈网络InfoFD,提升AI生成图像检测的泛化能力 |
multimodal |
✅ |
|
| 8 |
Seeing the Trees for the Forest: Rethinking Weakly-Supervised Medical Visual Grounding |
提出疾病感知提示(DAP)方法,提升弱监督医学图像视觉定位精度。 |
visual grounding |
|
|
| 9 |
Make LVLMs Focus: Context-Aware Attention Modulation for Better Multimodal In-Context Learning |
提出CAMA:通过上下文感知注意力调制增强LVLMs的多模态上下文学习能力 |
multimodal |
|
|
| 10 |
Pixels Versus Priors: Controlling Knowledge Priors in Vision-Language Models through Visual Counterfacts |
提出Pixels Versus Priors方法,通过视觉反事实控制视觉-语言模型中的知识先验。 |
large language model multimodal |
|
|
| 11 |
Analyzing Hierarchical Structure in Vision Models with Sparse Autoencoders |
利用稀疏自编码器分析视觉模型中的层级结构,揭示ImageNet层级信息的编码方式。 |
large language model foundation model |
|
|
| 12 |
Highlighting What Matters: Promptable Embeddings for Attribute-Focused Image Retrieval |
提出可Prompt的图像嵌入方法,用于属性聚焦的图像检索。 |
large language model multimodal |
|
|
| 13 |
Prompt Tuning Vision Language Models with Margin Regularizer for Few-Shot Learning under Distribution Shifts |
提出PromptMargin,通过多模态边际正则化提升视觉语言模型在分布偏移下的少样本学习能力 |
foundation model multimodal |
|
|
| 14 |
Blind Spot Navigation: Evolutionary Discovery of Sensitive Semantic Concepts for LVLMs |
提出基于语义演化的盲点导航方法,发现LVLMs对特定语义概念的敏感性 |
large language model multimodal |
|
|
| 15 |
From Pixels to Images: Deep Learning Advances in Remote Sensing Image Semantic Segmentation |
综述深度学习在遥感图像语义分割中的应用与进展 |
foundation model multimodal |
|
|
| 16 |
CineTechBench: A Benchmark for Cinematographic Technique Understanding and Generation |
CineTechBench:用于电影摄影技术理解与生成的新基准 |
large language model multimodal |
✅ |
|
| 17 |
Streamline Without Sacrifice -- Squeeze out Computation Redundancy in LMM |
ProxyV:通过代理视觉Token减少LMM计算冗余,提升效率 |
multimodal |
✅ |
|
| 18 |
Bridging Sign and Spoken Languages: Pseudo Gloss Generation for Sign Language Translation |
提出伪 gloss 生成框架,无需人工标注即可实现手语翻译。 |
large language model |
|
|
| 19 |
SCENIR: Visual Semantic Clarity through Unsupervised Scene Graph Retrieval |
SCENIR:提出基于无监督场景图检索的图像语义清晰化方法 |
multimodal |
|
|
| 20 |
How Do Large Vision-Language Models See Text in Image? Unveiling the Distinctive Role of OCR Heads |
揭示LVLM中OCR Head的作用:分析其如何识别图像中的文本 |
chain-of-thought |
|
|