| 1 |
LLaVA-RE: Binary Image-Text Relevancy Evaluation with Multimodal Large Language Model |
提出LLaVA-RE,利用多模态大语言模型进行二元图像-文本相关性评估。 |
large language model multimodal |
|
|
| 2 |
PhysPatch: A Physically Realizable and Transferable Adversarial Patch Attack for Multimodal Large Language Models-based Autonomous Driving Systems |
PhysPatch:面向多模态大语言模型自动驾驶系统的物理可实现且可迁移的对抗补丁攻击 |
large language model multimodal |
|
|
| 3 |
Uni-cot: Towards Unified Chain-of-Thought Reasoning Across Text and Vision |
提出Uni-CoT,用于统一文本和视觉的链式思考推理,实现多模态任务的SOTA性能。 |
large language model multimodal chain-of-thought |
✅ |
|
| 4 |
AI vs. Human Moderators: A Comparative Evaluation of Multimodal LLMs in Content Moderation for Brand Safety |
评估多模态LLM在品牌安全内容审核中的表现,对比AI与人工审核员 |
large language model multimodal |
|
|
| 5 |
mKG-RAG: Multimodal Knowledge Graph-Enhanced RAG for Visual Question Answering |
提出mKG-RAG,利用多模态知识图谱增强RAG,提升视觉问答性能 |
large language model multimodal |
|
|
| 6 |
Finding Needles in Images: Can Multimodal LLMs Locate Fine Details? |
提出NiM基准和Spot-IT方法,提升多模态大语言模型在复杂文档中定位细粒度细节的能力 |
large language model multimodal |
|
|
| 7 |
MedPatch: Confidence-Guided Multi-Stage Fusion for Multimodal Clinical Data |
MedPatch:一种置信度引导的多阶段融合方法,用于多模态临床数据分析 |
multimodal |
|
|
| 8 |
AdaFusion: Prompt-Guided Inference with Adaptive Fusion of Pathology Foundation Models |
AdaFusion:一种基于提示引导的病理学Foundation Model自适应融合方法 |
foundation model |
|
|
| 9 |
A Context-aware Attention and Graph Neural Network-based Multimodal Framework for Misogyny Detection |
提出基于上下文感知注意力与图神经网络的多模态框架,用于检测仇恨女性言论。 |
multimodal |
|
|
| 10 |
Follow-Your-Instruction: A Comprehensive MLLM Agent for World Data Synthesis |
提出Follow-Your-Instruction,一个基于MLLM的综合性Agent,用于世界数据自动合成。 |
large language model multimodal |
|
|
| 11 |
MELLA: Bridging Linguistic Capability and Cultural Groundedness for Low-Resource Language MLLMs |
MELLA:为低资源语言MLLM弥合语言能力与文化基础的差距 |
large language model multimodal |
|
|
| 12 |
B4DL: A Benchmark for 4D LiDAR LLM in Spatio-Temporal Understanding |
提出B4DL基准,用于4D激光雷达LLM的时空理解 |
large language model multimodal |
✅ |
|
| 13 |
Symmetry Understanding of 3D Shapes via Chirality Disentanglement |
提出基于Diff3F框架的无监督 chirality 特征提取方法,用于3D形状的左右对称性解耦。 |
foundation model |
✅ |
|
| 14 |
Explaining Similarity in Vision-Language Encoders with Weighted Banzhaf Interactions |
提出FIxLIP,利用加权Banzhaf交互解释视觉-语言编码器中的相似性,优于一阶方法。 |
multimodal |
|
|
| 15 |
Segmenting the Complex and Irregular in Two-Phase Flows: A Real-World Empirical Study with SAM2 |
利用微调SAM2分割复杂气液两相流中的不规则气泡,解决传统方法局限性 |
foundation model |
|
|
| 16 |
VFlowOpt: A Token Pruning Framework for LMMs with Visual Information Flow-Guided Optimization |
VFlowOpt:视觉信息流引导的大模型Token剪枝框架,提升推理效率。 |
multimodal |
|
|
| 17 |
IAD-R1: Reinforcing Consistent Reasoning in Industrial Anomaly Detection |
提出IAD-R1框架,增强视觉-语言模型在工业异常检测中的推理一致性。 |
chain-of-thought |
✅ |
|
| 18 |
Surformer v1: Transformer-Based Surface Classification Using Tactile and Vision Features |
提出Surformer v1,利用Transformer融合触觉与视觉特征进行表面分类。 |
multimodal |
|
|