| 1 |
Dynamic Pyramid Network for Efficient Multimodal Large Language Model |
提出动态金字塔网络DPN,用于高效多模态大语言模型,提升性能并降低计算成本。 |
large language model multimodal |
✅ |
|
| 2 |
Unified Multimodal Discrete Diffusion |
提出UniDisc:统一多模态离散扩散模型,实现文本图像联合生成与理解 |
multimodal |
|
|
| 3 |
Math Blind: Failures in Diagram Understanding Undermine Reasoning in MLLMs |
揭示MLLM在图表理解中的“数学盲”现象,并提出基于图结构的改进方案 |
large language model multimodal chain-of-thought |
|
|
| 4 |
MMMORRF: Multimodal Multilingual Modularized Reciprocal Rank Fusion |
提出MMMORRF,通过模态感知的加权倒数排序融合,提升多模态视频检索效果。 |
multimodal |
|
|
| 5 |
TerraTorch: The Geospatial Foundation Models Toolkit |
TerraTorch:用于地球空间基础模型的微调与基准测试工具包 |
foundation model |
✅ |
|
| 6 |
CryoSAMU: Enhancing 3D Cryo-EM Density Maps of Protein Structures at Intermediate Resolution with Structure-Aware Multimodal U-Nets |
CryoSAMU:利用结构感知多模态U-Net增强中间分辨率冷冻电镜蛋白结构密度图 |
multimodal |
✅ |
|
| 7 |
ViLBench: A Suite for Vision-Language Process Reward Modeling |
提出ViLBench,用于评估视觉-语言过程奖励模型的细粒度反馈能力 |
large language model multimodal chain-of-thought |
✅ |
|
| 8 |
Beyond Words: Advancing Long-Text Image Generation via Multimodal Autoregressive Models |
提出多模态自回归模型以解决长文本图像生成问题 |
multimodal |
|
|
| 9 |
Multimodal Image Matching based on Frequency-domain Information of Local Energy Response |
提出基于局部能量响应频域信息的多模态图像匹配方法FILER,解决非线性差异等难题。 |
multimodal |
|
|
| 10 |
Mitigating Low-Level Visual Hallucinations Requires Self-Awareness: Database, Model and Training Strategy |
提出SAFEQA模型和ESA-PO框架,缓解多模态大模型在底层视觉任务中的幻觉问题 |
large language model multimodal |
|
|
| 11 |
Vision-Amplified Semantic Entropy for Hallucination Detection in Medical Visual Question Answering |
提出视觉增强语义熵(VASE)用于医疗VQA中幻觉检测 |
large language model multimodal |
|
|
| 12 |
Instruction-Oriented Preference Alignment for Enhancing Multi-Modal Comprehension Capability of MLLMs |
提出指令导向的偏好对齐以提升多模态理解能力 |
large language model multimodal |
|
|
| 13 |
Skip-Vision: Efficient and Scalable Acceleration of Vision-Language Models via Adaptive Token Skipping |
Skip-Vision:通过自适应Token跳过加速视觉-语言模型,提升效率与可扩展性 |
large language model multimodal |
|
|
| 14 |
Free4D: Tuning-free 4D Scene Generation with Spatial-Temporal Consistency |
提出Free4D框架以解决单图像生成4D场景问题 |
foundation model |
|
|
| 15 |
Dynamic Motion Blending for Versatile Motion Editing |
提出MotionReFit,通过动态运动混合实现通用文本引导的运动编辑 |
large language model |
|
|
| 16 |
Shape Generation via Weight Space Learning |
通过权重空间学习实现形状生成,探索3D生成模型的下游任务新范式。 |
foundation model |
|
|
| 17 |
MLLM-Selector: Necessity and Diversity-driven High-Value Data Selection for Enhanced Visual Instruction Tuning |
提出MLLM-Selector,通过必要性和多样性驱动的高价值数据选择增强视觉指令微调。 |
large language model |
|
|
| 18 |
From Trial to Triumph: Advancing Long Video Understanding via Visual Context Sample Scaling and Self-reward Alignment |
提出基于视觉上下文采样和自奖励对齐的长视频理解方法 |
large language model |
|
|
| 19 |
VideoGEM: Training-free Action Grounding in Videos |
提出VideoGEM,一种无需训练的视频空间动作定位方法,优于现有训练方法。 |
foundation model |
|
|
| 20 |
Wan: Open and Advanced Large-Scale Video Generative Models |
Wan:开放先进的大规模视频生成模型,显著提升生成能力和效率 |
foundation model |
✅ |
|
| 21 |
Protecting Your Video Content: Disrupting Automated Video-based LLM Annotations |
提出Ramblings和Mutes视频水印,对抗基于视频的LLM的自动标注。 |
large language model |
✅ |
|
| 22 |
Faster Parameter-Efficient Tuning with Token Redundancy Reduction |
提出FPET,通过token冗余缩减加速参数高效微调并降低计算开销。 |
foundation model |
|
|
| 23 |
Exploring CLIP's Dense Knowledge for Weakly Supervised Semantic Segmentation |
提出ExCEL,通过patch-text对齐探索CLIP的密集知识,用于弱监督语义分割 |
large language model |
|
|