| 1 |
DeepDubber-V1: Towards High Quality and Dialogue, Narration, Monologue Adaptive Movie Dubbing Via Multi-Modal Chain-of-Thoughts Reasoning Guidance |
提出DeepDubber-V1,通过多模态CoT推理指导电影配音,提升质量并适应不同风格。 |
large language model multimodal chain-of-thought |
|
|
| 2 |
FlexiMo: A Flexible Remote Sensing Foundation Model |
FlexiMo:提出一种灵活的遥感基础模型,适应任意空间分辨率。 |
foundation model multimodal |
|
|
| 3 |
Context-Independent OCR with Multimodal LLMs: Effects of Image Resolution and Visual Complexity |
研究多模态LLM在上下文无关OCR中的图像分辨率和视觉复杂度影响 |
large language model multimodal |
|
|
| 4 |
Leveraging Diffusion Model and Image Foundation Model for Improved Correspondence Matching in Coronary Angiography |
利用扩散模型和图像基础模型提升冠状动脉造影中的对应点匹配 |
foundation model |
|
|
| 5 |
Adapting Vision Foundation Models for Real-time Ultrasound Image Segmentation |
提出一种自适应视觉基础模型,用于实时超声图像分割。 |
foundation model |
|
|
| 6 |
PathOrchestra: A Comprehensive Foundation Model for Computational Pathology with Over 100 Diverse Clinical-Grade Tasks |
PathOrchestra:一个用于计算病理学的综合基础模型,支持超过100项临床任务 |
foundation model |
|
|
| 7 |
Can Test-Time Scaling Improve World Foundation Model? |
提出SWIFT框架,通过测试时计算扩展提升世界基础模型性能 |
foundation model |
✅ |
|
| 8 |
FakeScope: Large Multimodal Expert Model for Transparent AI-Generated Image Forensics |
提出FakeScope:用于透明AI生成图像取证的大型多模态专家模型 |
multimodal |
|
|
| 9 |
MB-ORES: A Multi-Branch Object Reasoner for Visual Grounding in Remote Sensing |
提出MB-ORES,用于遥感图像中基于多分支对象推理的视觉定位 |
visual grounding |
✅ |
|
| 10 |
Foundation Models For Seismic Data Processing: An Extensive Review |
评估自然图像预训练模型在地震数据处理中的应用潜力 |
foundation model |
|
|
| 11 |
AI-Assisted Colonoscopy: Polyp Detection and Segmentation using Foundation Models |
利用Foundation Model进行AI辅助结肠镜息肉检测与分割 |
foundation model |
|
|
| 12 |
IMPACT: A Generic Semantic Loss for Multimodal Medical Image Registration |
IMPACT:一种通用的多模态医学图像配准语义损失函数 |
multimodal |
|
|
| 13 |
PolypSegTrack: Unified Foundation Model for Colonoscopy Video Analysis |
PolypSegTrack:用于结肠镜视频分析的统一基础模型,实现息肉的检测、分割、分类和跟踪。 |
foundation model |
|
|
| 14 |
HumanAesExpert: Advancing a Multi-Modality Foundation Model for Human Image Aesthetic Assessment |
提出HumanAesExpert以解决人像美学评估问题 |
foundation model |
✅ |
|
| 15 |
STI-Bench: Are MLLMs Ready for Precise Spatial-Temporal World Understanding? |
STI-Bench:评估多模态大语言模型在时空理解方面的能力 |
embodied AI large language model multimodal |
|
|
| 16 |
Any2Caption:Interpreting Any Condition to Caption for Controllable Video Generation |
Any2Caption:提出一种条件可控的视频生成框架,通过多模态大语言模型将任意条件转化为详细描述。 |
large language model multimodal |
|
|
| 17 |
Chapter-Llama: Efficient Chaptering in Hour-Long Videos with LLMs |
Chapter-Llama利用LLM高效处理长视频章节划分与标题生成任务 |
large language model TAMP |
|
|
| 18 |
H2VU-Benchmark: A Comprehensive Benchmark for Hierarchical Holistic Video Understanding |
提出H2VU基准,用于全面评估分层整体视频理解能力,尤其针对长视频和在线流媒体。 |
large language model multimodal |
|
|
| 19 |
COSMO: Combination of Selective Memorization for Low-cost Vision-and-Language Navigation |
提出COSMO模型,通过选择性记忆降低视觉-语言导航的计算成本并提升性能。 |
VLN |
|
|
| 20 |
Towards Understanding How Knowledge Evolves in Large Vision-Language Models |
揭示大规模视觉语言模型中知识演化轨迹,为理解其内在机制提供新视角。 |
multimodal |
✅ |
|
| 21 |
Style Quantization for Data-Efficient GAN Training |
提出SQ-GAN,通过风格量化提升数据稀缺场景下的GAN训练效果 |
foundation model |
|
|
| 22 |
It's a (Blind) Match! Towards Vision-Language Correspondence without Parallel Data |
提出无平行数据的视觉-语言对应方法,探索模型表征的无监督匹配 |
foundation model |
|
|
| 23 |
Boosting MLLM Reasoning with Text-Debiased Hint-GRPO |
提出Hint-GRPO,通过文本去偏Hint机制提升MLLM在复杂多模态推理任务中的性能。 |
multimodal |
✅ |
|
| 24 |
MGD-SAM2: Multi-view Guided Detail-enhanced Segment Anything Model 2 for High-Resolution Class-agnostic Segmentation |
MGD-SAM2:多视角引导的细节增强SAM2模型,用于高分辨率无类别分割 |
foundation model |
✅ |
|
| 25 |
Short-video Propagation Influence Rating: A New Real-world Dataset and A New Large Graph Model |
提出XS-Video数据集与NetGPT模型,用于短视频跨平台传播影响力评估 |
large language model |
|
|