| 1 |
Enhancing Multimodal Large Language Models Complex Reason via Similarity Computation |
提出Simignore以提升多模态大语言模型的复杂推理能力 |
large language model multimodal chain-of-thought |
✅ |
|
| 2 |
EVLM: Self-Reflective Multimodal Reasoning for Cross-Dimensional Visual Editing |
EVLM:通过自反思多模态推理实现跨维度视觉编辑 |
multimodal chain-of-thought |
|
|
| 3 |
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding |
DeepSeek-VL2:面向高级多模态理解的混合专家视觉语言模型 |
multimodal visual grounding |
✅ |
|
| 4 |
DEFAME: Dynamic Evidence-based FAct-checking with Multimodal Experts |
DEFAME:提出基于动态证据和多模态专家的事实核查框架,显著提升文本图像混合场景下的核查性能。 |
multimodal |
|
|
| 5 |
Apollo: An Exploration of Video Understanding in Large Multimodal Models |
Apollo:探索大规模多模态模型中的视频理解能力,并提出高效训练策略。 |
multimodal |
|
|
| 6 |
Robust image classification with multi-modal large language models |
提出MultiShield,利用多模态大语言模型提升图像分类模型对抗攻击的鲁棒性。 |
large language model |
|
|
| 7 |
CognitionCapturer: Decoding Visual Stimuli From Human EEG Signal With Multimodal Information |
CognitionCapturer:利用多模态信息从人脑EEG信号中解码视觉刺激 |
multimodal |
✅ |
|
| 8 |
Learning Complex Non-Rigid Image Edits from Multimodal Conditioning |
提出基于多模态条件控制的图像编辑方法,实现人物插入和姿态编辑。 |
multimodal |
|
|
| 9 |
A multimodal dataset for understanding the impact of mobile phones on remote online virtual education |
IMPROVE:一个用于理解手机使用对远程在线教育影响的多模态数据集 |
multimodal |
|
|
| 10 |
B-VLLM: A Vision Large Language Model with Balanced Spatio-Temporal Tokens |
提出B-VLLM以解决长视频理解中的视觉令牌数量问题 |
large language model |
✅ |
|
| 11 |
All-in-One: Transferring Vision Foundation Models into Stereo Matching |
AIO-Stereo:将视觉基础模型迁移至立体匹配,实现性能突破 |
foundation model |
|
|
| 12 |
Iris: Breaking GUI Complexity with Adaptive Focus and Self-Refining |
Iris:通过自适应聚焦和自精炼打破GUI复杂性的视觉Agent |
large language model multimodal |
|
|
| 13 |
BrushEdit: All-In-One Image Inpainting and Editing |
提出BrushEdit,一种基于图像修复的交互式指令图像编辑框架 |
large language model multimodal |
|
|
| 14 |
Towards Unified Benchmark and Models for Multi-Modal Perceptual Metrics |
提出UniSim-Bench多模态感知指标评测基准,并探索统一的多模态感知模型。 |
multimodal |
✅ |
|
| 15 |
Single-Pass Object-Focused Data Selection |
提出对象聚焦数据选择方法以优化标注预算 |
foundation model |
|
|
| 16 |
Dynamic Cross-Modal Alignment for Robust Semantic Location Prediction |
提出CoVLA框架,解决多模态社交媒体语义位置预测中的歧义与差异问题 |
multimodal |
|
|
| 17 |
CP-DETR: Concept Prompt Guide DETR Toward Stronger Universal Object Detection |
CP-DETR:通过概念提示引导DETR实现更强大的通用目标检测 |
foundation model |
|
|