| 1 |
Multimodal Language Models See Better When They Look Shallower |
提出视觉层选择策略以提升多模态大语言模型性能 |
large language model multimodal |
|
|
| 2 |
UniBiomed: A Universal Foundation Model for Grounded Biomedical Image Interpretation |
UniBiomed:用于可解释生物医学图像分析的通用基础模型 |
large language model foundation model |
|
|
| 3 |
CMD: Constraining Multimodal Distribution for Domain Adaptation in Stereo Matching |
提出CMD方法,约束立体匹配域适应中的多峰分布问题 |
multimodal |
✅ |
|
| 4 |
GarmentDiffusion: 3D Garment Sewing Pattern Generation with Multimodal Diffusion Transformers |
GarmentDiffusion:基于多模态扩散Transformer的3D服装缝纫纸样生成 |
multimodal |
✅ |
|
| 5 |
Localizing Before Answering: A Hallucination Evaluation Benchmark for Grounded Medical Multimodal LLMs |
提出HEAL-MedVQA基准与LobA框架,提升医学多模态LLM的定位能力与抗幻觉性。 |
multimodal |
|
|
| 6 |
A Survey of Interactive Generative Video |
综述交互式生成视频技术,提出包含五大模块的通用框架,并分析未来发展方向。 |
embodied AI multimodal |
|
|
| 7 |
COMPACT: COMPositional Atomic-to-Complex Visual Capability Tuning |
提出COMPACT,通过组合原子视觉能力进行高效多模态大模型微调。 |
large language model multimodal |
|
|
| 8 |
Why Compress What You Can Generate? When GPT-4o Generation Ushers in Image Compression Fields |
利用GPT-4o生成能力,探索AIGC在极低码率图像压缩中的应用,实现优异性能。 |
foundation model multimodal |
|
|
| 9 |
Diff-Prompt: Diffusion-Driven Prompt Generator with Mask Supervision |
Diff-Prompt:利用扩散模型和掩码监督生成细粒度Prompt,提升多模态模型在复杂任务上的性能。 |
foundation model multimodal |
✅ |
|
| 10 |
Simple Visual Artifact Detection in Sora-Generated Videos |
提出一种基于多标签分类的框架,用于检测Sora生成视频中的视觉伪影。 |
large language model multimodal |
|
|
| 11 |
Zoomer: Adaptive Image Focus Optimization for Black-box MLLM |
Zoomer:针对黑盒MLLM的自适应图像焦点优化框架,提升小物体识别能力。 |
large language model multimodal |
|
|
| 12 |
DOPE: Dual Object Perception-Enhancement Network for Vision-and-Language Navigation |
提出DOPE网络,增强视觉语言导航中智能体的对象感知能力 |
VLN |
|
|
| 13 |
SeriesBench: A Benchmark for Narrative-Driven Drama Series Understanding |
提出SeriesBench,用于评估多模态大语言模型在叙事驱动型剧集理解方面的能力。 |
large language model |
✅ |
|
| 14 |
Responsive DNN Adaptation for Video Analytics against Environment Shift via Hierarchical Mobile-Cloud Collaborations |
MOCHA:针对环境变化的视频分析,提出响应式DNN分层移动-云协同自适应框架 |
foundation model |
|
|
| 15 |
Nexus-Gen: Unified Image Understanding, Generation, and Editing via Prefilled Autoregression in Shared Embedding Space |
Nexus-Gen:通过共享嵌入空间中的预填充自回归实现统一的图像理解、生成和编辑 |
multimodal |
✅ |
|
| 16 |
CoCoDiff: Diversifying Skeleton Action Features via Coarse-Fine Text-Co-Guided Latent Diffusion |
CoCoDiff:通过粗细粒度文本协同引导的潜在扩散模型,提升骨骼动作识别特征多样性。 |
large language model |
|
|