| 1 |
Temporal Gains, Spatial Costs: Revisiting Video Fine-Tuning in Multimodal Large Language Models |
揭示多模态大语言模型中视频微调的空间代价与时间收益权衡 |
large language model multimodal |
|
|
| 2 |
MCoT-MVS: Multi-level Vision Selection by Multi-modal Chain-of-Thought Reasoning for Composed Image Retrieval |
提出MCoT-MVS,通过多模态CoT推理实现组合图像检索中的精准视觉选择。 |
large language model multimodal chain-of-thought |
|
|
| 3 |
Parameter-Efficient Modality-Balanced Symmetric Fusion for Multimodal Remote Sensing Semantic Segmentation |
提出MoBaNet以解决多模态遥感语义分割中的模态不平衡问题 |
foundation model multimodal |
✅ |
|
| 4 |
EI: Early Intervention for Multimodal Imaging based Disease Recognition |
提出EI框架,通过早期干预和MoR自适应,提升多模态医学影像疾病识别精度。 |
foundation model multimodal |
|
|
| 5 |
Revisiting foundation models for cell instance segmentation |
针对细胞实例分割,论文评估并改进了基于SAM的多个Foundation Model |
foundation model |
✅ |
|
| 6 |
UniSAFE: A Comprehensive Benchmark for Safety Evaluation of Unified Multimodal Models |
UniSAFE:用于统一多模态模型安全性评估的综合基准 |
multimodal |
✅ |
|
| 7 |
Harnessing the Power of Foundation Models for Accurate Material Classification |
提出一种利用Foundation Model的材料分类框架,解决数据稀缺问题并提升分类精度。 |
foundation model |
|
|
| 8 |
A Proposal-Free Query-Guided Network for Grounded Multimodal Named Entity Recognition |
提出无建议框的查询引导网络QGN,解决GMNER中检测器与实体不匹配问题。 |
multimodal |
|
|
| 9 |
Concept-to-Pixel: Prompt-Free Universal Medical Image Segmentation |
提出Concept-to-Pixel框架以解决医学图像分割的自动化与鲁棒性问题 |
large language model multimodal |
✅ |
|
| 10 |
FineViT: Progressively Unlocking Fine-Grained Perception with Dense Recaptions |
FineViT:通过密集重述解锁细粒度感知,提升视觉编码器性能 |
large language model multimodal |
|
|
| 11 |
LED: A Benchmark for Evaluating Layout Error Detection in Document Analysis |
提出LED基准,用于评估文档分析中版面错误检测的结构推理能力。 |
large language model multimodal |
|
|
| 12 |
From Drop-off to Recovery: A Mechanistic Analysis of Segmentation in MLLMs |
揭示MLLM图像分割机理:分析视觉编码、适配器与LLM层间的交互作用 |
large language model multimodal |
|
|
| 13 |
The Unreasonable Effectiveness of Text Embedding Interpolation for Continuous Image Steering |
提出一种免训练的文本嵌入插值方法,实现对文本条件生成图像的连续控制。 |
large language model |
|
|
| 14 |
VideoAtlas: Navigating Long-Form Video in Logarithmic Compute |
提出VideoAtlas,以对长视频进行对数计算复杂度的导航和理解。 |
multimodal |
|
|
| 15 |
Interpretable Traffic Responsibility from Dashcam Video via Legal Multi Agent Reasoning |
提出C-TRAIL数据集和多智能体法律推理框架,从行车记录仪视频中自动判定交通事故责任 |
multimodal |
|
|
| 16 |
Edit Spillover as a Probe: Do Image Editing Models Implicitly Understand World Relations? |
提出EditSpilloverProbe,用于评估图像编辑模型对世界关系的隐式理解能力 |
instruction following |
|
|
| 17 |
Fine-Grained Post-Training Quantization for Large Vision Language Models with Quantization-Aware Integrated Gradients |
提出基于量化感知积分梯度的细粒度后训练量化方法,提升大视觉语言模型量化性能。 |
multimodal |
✅ |
|
| 18 |
Eye image segmentation using visual and concept prompts with Segment Anything Model 3 (SAM3) |
评估SAM3在眼部图像分割任务中的性能,并与SAM2对比。 |
foundation model |
|
|
| 19 |
Omni-I2C: A Holistic Benchmark for High-Fidelity Image-to-Code Generation |
提出Omni-I2C基准,用于评估大模型将图像转换为可执行代码的能力 |
multimodal |
✅ |
|