| 1 |
Interpretable Face Anti-Spoofing: Enhancing Generalization with Multimodal Large Language Models |
提出I-FAS:利用多模态大语言模型提升人脸反欺骗的泛化能力与可解释性 |
large language model multimodal |
|
|
| 2 |
Multimodal classification of forest biodiversity potential from 2D orthophotos and 3D airborne laser scanning point clouds |
提出基于深度学习的多模态融合方法,利用正射影像和激光雷达数据评估森林生物多样性潜力。 |
multimodal |
|
|
| 3 |
Google is all you need: Semi-Supervised Transfer Learning Strategy For Light Multimodal Multi-Task Classification Model |
提出一种半监督迁移学习策略,用于轻量级多模态多任务分类模型,提升图像标签精度。 |
multimodal |
|
|
| 4 |
VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction |
VITA-1.5:面向GPT-4o水平的实时视觉与语音交互多模态大模型 |
large language model multimodal |
✅ |
|
| 5 |
Virgo: A Preliminary Exploration on Reproducing o1-like MLLM |
Virgo:通过文本长程思维数据微调MLLM,探索多模态慢思考推理能力 |
large language model multimodal |
✅ |
|
| 6 |
HLV-1K: A Large-scale Hour-Long Video Benchmark for Time-Specific Long Video Understanding |
构建大规模小时级视频基准HLV-1K,促进时间感知长视频理解研究。 |
large language model multimodal |
|
|
| 7 |
AVTrustBench: Assessing and Enhancing Reliability and Robustness in Audio-Visual LLMs |
AVTrustBench:评估并提升音视频大语言模型的可靠性和鲁棒性 |
large language model |
|
|
| 8 |
MoEE: Mixture of Emotion Experts for Audio-Driven Portrait Animation |
提出MoEE模型和DH-FaceEmoVid-150数据集,用于生成具有复杂情感的音频驱动人像动画。 |
multimodal |
|
|
| 9 |
LogicAD: Explainable Anomaly Detection via VLM-based Text Feature Extraction |
LogicAD:基于VLM文本特征提取的可解释异常检测 |
multimodal |
|
|