| 1 |
Chain-of-Caption: Training-free improvement of multimodal large language model on referring expression comprehension |
提出Chain-of-Caption框架,无需训练即可提升多模态大语言模型在指代表达理解任务上的性能。 |
large language model multimodal chain-of-thought |
|
|
| 2 |
From Correspondence to Actions: Human-Like Multi-Image Spatial Reasoning in Multi-modal Large Language Models |
提出HATCH框架,提升多模态大语言模型在多视角空间推理中的人类相似性 |
large language model multimodal |
|
|
| 3 |
TiFRe: Text-guided Video Frame Reduction for Efficient Video Multi-modal Large Language Models |
提出TiFRe框架,通过文本引导的视频帧减少提升Video MLLM效率 |
large language model |
|
|
| 4 |
Multimodal Learning for Arcing Detection in Pantograph-Catenary Systems |
提出MultiDeepSAD多模态学习框架,用于受电弓-接触网系统中电弧故障检测。 |
multimodal |
|
|
| 5 |
Zero-shot System for Automatic Body Region Detection for Volumetric CT and MR Images |
提出零样本方法,利用预训练模型自动检测CT/MR图像中的身体区域 |
large language model foundation model multimodal |
|
|
| 6 |
OneVision-Encoder: Codec-Aligned Sparsity as a Foundational Principle for Multimodal Intelligence |
OneVision-Encoder:编解码器对齐的稀疏性作为多模态智能的基础原则 |
multimodal |
|
|
| 7 |
GeoFocus: Blending Efficient Global-to-Local Perception for Multimodal Geometry Problem-Solving |
GeoFocus:融合全局到局部高效感知的多模态几何问题求解框架 |
multimodal |
✅ |
|
| 8 |
Are Vision Foundation Models Foundational for Electron Microscopy Image Segmentation? |
研究视觉基础模型在电子显微镜图像分割中的适用性,揭示跨数据集泛化难题 |
foundation model |
|
|
| 9 |
A Unified Framework for Multimodal Image Reconstruction and Synthesis using Denoising Diffusion Models |
Any2all:基于去噪扩散模型的统一多模态图像重建与合成框架 |
multimodal |
|
|
| 10 |
Vista: Scene-Aware Optimization for Streaming Video Question Answering under Post-Hoc Queries |
提出Vista框架以解决流媒体视频问答中的场景感知问题 |
large language model multimodal |
|
|
| 11 |
Omni-Video 2: Scaling MLLM-Conditioned Diffusion for Unified Video Generation and Editing |
Omni-Video 2:扩展MLLM条件扩散模型,实现统一的视频生成与编辑 |
multimodal |
|
|
| 12 |
MOVA: Towards Scalable and Synchronized Video-Audio Generation |
MOVA:面向可扩展和同步的视频-音频生成模型 |
multimodal |
|
|
| 13 |
TimeChat-Captioner: Scripting Multi-Scene Videos with Time-Aware and Structural Audio-Visual Captions |
提出Omni Dense Captioning以生成时间感知的多场景视频描述 |
TAMP |
✅ |
|
| 14 |
ALIVE: Animate Your World with Lifelike Audio-Video Generation |
ALIVE:通过逼真的音视频生成技术,赋予世界生机 |
foundation model |
✅ |
|
| 15 |
Improving Reconstruction of Representation Autoencoder |
提出LV-RAE,通过增强低层信息和优化解码器,提升表征自编码器的图像重建和生成质量。 |
foundation model |
✅ |
|