| 1 |
UniT: Unified Multimodal Chain-of-Thought Test-time Scaling |
提出UniT,通过多模态思维链测试时扩展提升统一模型的推理能力。 |
multimodal chain-of-thought |
|
|
| 2 |
Mask What Matters: Mitigating Object Hallucinations in Multimodal Large Language Models with Object-Aligned Visual Contrastive Decoding |
提出对象对齐视觉对比解码,缓解多模态大语言模型中的对象幻觉问题 |
large language model multimodal |
|
|
| 3 |
Spatial Chain-of-Thought: Bridging Understanding and Generation Models for Spatial Reasoning Generation |
提出空间思维链(SCoT)框架,提升扩散模型在空间推理生成任务上的性能。 |
large language model multimodal chain-of-thought |
|
|
| 4 |
ScalSelect: Scalable Training-Free Multimodal Data Selection for Efficient Visual Instruction Tuning |
提出ScalSelect以解决大规模多模态数据选择效率问题 |
multimodal |
✅ |
|
| 5 |
A Large Language Model for Disaster Structural Reconnaissance Summarization |
提出基于LLM的灾后结构快速勘察总结框架,提升灾后重建效率。 |
large language model |
|
|
| 6 |
Adapting Vision-Language Models for E-commerce Understanding at Scale |
针对电商场景,提出一种有效适配视觉-语言模型的大规模方法。 |
multimodal instruction following |
|
|
| 7 |
LLM-Driven 3D Scene Generation of Agricultural Simulation Environments |
提出基于LLM的模块化流程,用于生成农业模拟环境的3D场景 |
large language model |
|
|
| 8 |
DreamID-Omni: Unified Framework for Controllable Human-Centric Audio-Video Generation |
DreamID-Omni:统一可控的以人为中心的音视频生成框架 |
foundation model |
|
|
| 9 |
U-Net with Hadamard Transform and DCT Latent Spaces for Next-day Wildfire Spread Prediction |
提出TD-FusionUNet,利用变换域融合进行轻量级次日野火蔓延预测 |
multimodal |
|
|
| 10 |
Vascular anatomy-aware self-supervised pre-training for X-ray angiogram analysis |
提出VasoMIM血管解剖感知自监督预训练框架,提升X光血管造影图像分析性能 |
foundation model |
✅ |
|