| 1 |
CLASP: Class-Adaptive Layer Fusion and Dual-Stage Pruning for Multimodal Large Language Models |
CLASP:面向多模态大语言模型的类自适应层融合与双阶段剪枝 |
large language model multimodal |
✅ |
|
| 2 |
All in One: A Unified Synthetic Data Pipeline for Multimodal Video Understanding |
提出统一合成数据流水线,解决多模态视频理解中数据匮乏问题 |
large language model multimodal visual grounding |
|
|
| 3 |
Chain-of-Models Pre-Training: Rethinking Training Acceleration of Vision Foundation Models |
提出模型链预训练(CoM-PT),加速视觉基础模型训练且无性能损失。 |
large language model foundation model |
|
|
| 4 |
Probabilistic Feature Imputation and Uncertainty-Aware Multimodal Federated Aggregation |
提出P-FIN,解决多模态联邦学习中特征缺失和不确定性问题,提升医疗诊断安全性。 |
multimodal |
|
|
| 5 |
Towards Long-horizon Agentic Multimodal Search |
提出LMM-Searcher以解决长时段多模态搜索问题 |
multimodal |
✅ |
|
| 6 |
Challenging Vision-Language Models with Physically Deployable Multimodal Semantic Lighting Attacks |
提出多模态语义光照攻击MSLA,挑战视觉-语言模型在物理世界的安全性。 |
multimodal |
|
|
| 7 |
AffectAgent: Collaborative Multi-Agent Reasoning for Retrieval-Augmented Multimodal Emotion Recognition |
提出AffectAgent以解决多模态情感识别中的模态歧义问题 |
multimodal |
✅ |
|
| 8 |
Brain-DiT: A Universal Multi-state fMRI Foundation Model with Metadata-Conditioned Pretraining |
Brain-DiT:基于元数据条件扩散预训练的通用多状态fMRI基础模型 |
foundation model |
✅ |
|
| 9 |
GeoAlign: Geometric Feature Realignment for MLLM Spatial Reasoning |
GeoAlign通过几何特征重对齐提升MLLM的空间推理能力 |
large language model foundation model multimodal |
|
|
| 10 |
Relaxing Anchor-Frame Dominance for Mitigating Hallucinations in Video Large Language Models |
提出Decoder-side Temporal Rebalancing (DTR)以缓解视频大语言模型中的幻觉问题 |
large language model |
|
|
| 11 |
MODIX: A Training-Free Multimodal Information-Driven Positional Index Scaling for Vision-Language Models |
MODIX:一种免训练的多模态信息驱动的位置索引缩放方法,提升视觉-语言模型性能 |
multimodal |
|
|
| 12 |
Boosting Visual Instruction Tuning with Self-Supervised Guidance |
提出V-GIFT,通过自监督指导提升视觉指令微调,增强MLLM的视觉推理能力 |
large language model multimodal |
✅ |
|
| 13 |
Distorted or Fabricated? A Survey on Hallucination in Video LLMs |
对视频大语言模型幻觉现象的全面综述,提出系统分类与缓解策略。 |
large language model visual grounding |
✅ |
|
| 14 |
DPC-VQA: Decoupling Quality Perception and Residual Calibration for Video Quality Assessment |
提出DPC-VQA,解耦质量感知与残差校准,高效评估视频质量 |
large language model multimodal |
|
|
| 15 |
NTIRE 2026 The 3rd Restore Any Image Model (RAIM) Challenge: Professional Image Quality Assessment (Track 1) |
NTIRE 2026 RAIM挑战赛:探索MLLM在专业图像质量评估中的应用 |
large language model multimodal |
✅ |
|
| 16 |
Why and When Visual Token Pruning Fails? A Study on Relevant Visual Information Shift in MLLMs Decoding |
提出DSTP框架,解决MLLM解码过程中视觉token剪枝在复杂推理任务中性能下降问题 |
large language model multimodal |
|
|
| 17 |
Agentic Discovery with Active Hypothesis Exploration for Visual Recognition |
HypoExplore:基于主动假设探索的Agentic视觉识别架构发现框架 |
large language model |
|
|
| 18 |
Don't Show Pixels, Show Cues: Unlocking Visual Tool Reasoning in Language Models via Perception Programs |
提出感知程序P²,通过语言原生线索提升多模态大语言模型视觉工具推理能力 |
multimodal |
|
|
| 19 |
OmniFood8K: Single-Image Nutrition Estimation via Hierarchical Frequency-Aligned Fusion |
提出OmniFood8K数据集和单图营养估计框架,解决中餐营养估计难题。 |
multimodal |
✅ |
|
| 20 |
Combating Pattern and Content Bias: Adversarial Feature Learning for Generalized AI-Generated Image Detection |
提出多维对抗特征学习框架,提升AI生成图像检测的泛化能力。 |
multimodal |
|
|
| 21 |
Towards Realistic and Consistent Orbital Video Generation via 3D Foundation Priors |
利用3D基础先验,实现逼真且一致的物体轨道视频生成 |
foundation model |
|
|
| 22 |
Boosting Robust AIGI Detection with LoRA-based Pairwise Training |
提出基于LoRA的Pairwise训练方法LPT,提升AIGI图像在复杂失真下的鲁棒检测性能。 |
foundation model |
|
|
| 23 |
Redefining Quality Criteria and Distance-Aware Score Modeling for Image Editing Assessment |
提出DS-IEQA框架,解决图像编辑质量评估中度量标准僵化和距离无关评分建模问题。 |
multimodal |
|
|