| 1 |
Advances in Multimodal Adaptation and Generalization: From Traditional Approaches to Foundation Models |
综述多模态自适应与泛化研究,涵盖传统方法到多模态预训练大模型 |
foundation model multimodal |
✅ |
|
| 2 |
Rethinking Bottlenecks in Safety Fine-Tuning of Vision Language Models |
提出MIS数据集,提升视觉语言模型在安全场景下的视觉推理能力 |
multimodal instruction following chain-of-thought |
|
|
| 3 |
High-Accuracy ECG Image Interpretation using Parameter-Efficient LoRA Fine-Tuning with Multimodal LLaMA 3.2 |
利用参数高效的LoRA微调多模态LLaMA 3.2模型,实现高精度ECG图像判读 |
multimodal |
|
|
| 4 |
Tuning Vision Foundation Model via Test-Time Prompt-Guided Training for VFSS Segmentations |
提出测试时提示引导训练方法,提升视觉基础模型在VFSS分割任务上的性能 |
foundation model |
|
|
| 5 |
AGAV-Rater: Adapting Large Multimodal Model for AI-Generated Audio-Visual Quality Assessment |
提出AGAV-Rater,利用大型多模态模型评估AI生成音视频质量,提升用户体验。 |
multimodal |
✅ |
|
| 6 |
Efficient Audiovisual Speech Processing via MUTUD: Multimodal Training and Unimodal Deployment |
提出MUTUD框架,实现多模态训练和单模态部署的高效语音处理 |
multimodal |
|
|
| 7 |
Every Image Listens, Every Image Dances: Music-Driven Image Animation |
MuseDance:提出一种音乐驱动的图像动画生成模型,无需复杂运动引导。 |
multimodal |
|
|
| 8 |
Multispectral 3D mapping on a Roman sculpture to study ancient polychromy |
提出一种基于多光谱3D建模的罗马雕塑色彩分析方法,用于研究古代雕塑的多彩性。 |
multimodal |
|
|
| 9 |
Human Re-ID Meets LVLMs: What can we expect? |
评估大型视觉语言模型在行人重识别任务中的性能与局限性 |
multimodal |
|
|
| 10 |
Foundational Models for 3D Point Clouds: A Survey and Outlook |
综述3D点云基础模型,填补领域内全面深入文献回顾的空白。 |
foundation model |
✅ |
|
| 11 |
CLEAR: Cue Learning using Evolution for Accurate Recognition Applied to Sustainability Data Extraction |
提出CLEAR框架,利用进化算法优化提示词,提升LLM在可持续性数据提取中的图像识别精度。 |
large language model |
|
|
| 12 |
A Video-grounded Dialogue Dataset and Metric for Event-driven Activities |
提出VDAct视频对话数据集与VDEval评测指标,用于事件驱动活动理解。 |
foundation model |
|
|