| 1 |
VisNumBench: Evaluating Number Sense of Multimodal Large Language Models |
提出VisNumBench,用于评估多模态大语言模型(MLLMs)的数字感知能力。 |
large language model multimodal chain-of-thought |
✅ |
|
| 2 |
UPME: An Unsupervised Peer Review Framework for Multimodal Large Language Model Evaluation |
提出UPME:一种无监督多模态大语言模型评估框架,缓解人工标注依赖。 |
large language model multimodal |
|
|
| 3 |
Visual Position Prompt for MLLM based Visual Grounding |
VPP-LLaVA:通过视觉位置提示增强MLLM的视觉定位能力 |
large language model multimodal visual grounding |
✅ |
|
| 4 |
Benchmarking Large Language Models for Handwritten Text Recognition |
评估大型语言模型在手写文本识别中的性能,探索零样本迁移能力 |
large language model multimodal |
|
|
| 5 |
EarthScape: A Multimodal Dataset for Surficial Geologic Mapping and Earth Surface Analysis |
EarthScape:用于地表地质填图和地球表面分析的多模态数据集 |
multimodal |
✅ |
|
| 6 |
LLaVA-MORE: A Comparative Study of LLMs and Visual Backbones for Enhanced Visual Instruction Tuning |
LLaVA-MORE:多模态大语言模型中LLM与视觉骨干网络对比研究,提升视觉指令调优效果 |
large language model multimodal instruction following |
✅ |
|
| 7 |
Visual Persona: Foundation Model for Full-Body Human Customization |
Visual Persona:用于全身人体定制的基座模型 |
foundation model |
|
|
| 8 |
EdgeRegNet: Edge Feature-based Multimodal Registration Network between Images and LiDAR Point Clouds |
EdgeRegNet:一种基于边缘特征的图像与LiDAR点云多模态配准网络 |
multimodal |
|
|
| 9 |
Generating Multimodal Driving Scenes via Next-Scene Prediction |
提出UMGen,通过预测下一场景生成多模态自动驾驶场景,支持地图模态。 |
multimodal |
✅ |
|
| 10 |
Spot the Fake: Large Multimodal Model-Based Synthetic Image Detection with Artifact Explanation |
提出FakeVLM:基于大模型的多模态合成图像检测与伪造解释 |
multimodal |
✅ |
|
| 11 |
Cube: A Roblox View of 3D Intelligence |
提出Cube:Roblox视角下的3D智能基础模型,实现3D内容生成与理解 |
large language model foundation model |
|
|
| 12 |
EfficientLLaVA:Generalizable Auto-Pruning for Large Vision-language Models |
EfficientLLaVA:面向大规模视觉语言模型的可泛化自动剪枝方法 |
large language model multimodal |
|
|
| 13 |
TruthLens:A Training-Free Paradigm for DeepFake Detection |
提出TruthLens,一种免训练的深度伪造检测框架,提升可解释性。 |
large language model multimodal |
|
|
| 14 |
MathFlow: Enhancing the Perceptual Flow of MLLMs for Visual Mathematical Problems |
MathFlow:提升MLLM在视觉数学问题中的感知能力 |
large language model multimodal |
✅ |
|
| 15 |
FAVOR-Bench: A Comprehensive Benchmark for Fine-Grained Video Motion Understanding |
FAVOR-Bench:用于细粒度视频运动理解的综合基准测试 |
large language model multimodal |
|
|
| 16 |
Mitigating Object Hallucinations in MLLMs via Multi-Frequency Perturbations |
提出多频扰动(MFP)方法,缓解多模态大语言模型中的物体幻觉问题 |
large language model multimodal |
|
|
| 17 |
Unlocking the Capabilities of Large Vision-Language Models for Generalizable and Explainable Deepfake Detection |
提出知识引导的伪造检测框架,提升大视觉语言模型在深度伪造检测中的泛化性和可解释性 |
large language model multimodal |
|
|
| 18 |
Vision-Speech Models: Teaching Speech Models to Converse about Images |
提出MoshiVis,赋予语音模型视觉理解能力,实现图像相关的语音对话 |
multimodal |
|
|
| 19 |
Forensics-Bench: A Comprehensive Forgery Detection Benchmark Suite for Large Vision Language Models |
提出Forensics-Bench,用于全面评估大型视觉语言模型在伪造检测中的能力。 |
multimodal |
✅ |
|
| 20 |
Universal Scene Graph Generation |
提出通用场景图(USG)表示及解析器,实现多模态场景语义的全面理解。 |
multimodal |
|
|