| 1 |
CogVLA: Cognition-Aligned Vision-Language-Action Model via Instruction-Driven Routing & Sparsification |
CogVLA:通过指令驱动的路由和稀疏化实现认知对齐的视觉-语言-动作模型 |
vision-language-action VLA multimodal |
✅ |
|
| 2 |
Dino U-Net: Exploiting High-Fidelity Dense Features from Foundation Models for Medical Image Segmentation |
Dino U-Net:利用DINOv3高保真密集特征提升医学图像分割精度 |
foundation model |
✅ |
|
| 3 |
PathMR: Multimodal Visual Reasoning for Interpretable Pathology Diagnosis |
提出PathMR:用于可解释病理诊断的多模态视觉推理框架 |
multimodal |
✅ |
|
| 4 |
Adapting Foundation Model for Dental Caries Detection with Dual-View Co-Training |
提出DVCTNet,利用双视角协同训练提升牙齿龋齿检测精度 |
foundation model |
✅ |
|
| 5 |
Graph-Based Uncertainty Modeling and Multimodal Fusion for Salient Object Detection |
提出基于图的不确定性建模与多模态融合的显著性目标检测网络,提升复杂场景下的检测精度。 |
multimodal |
✅ |
|
| 6 |
MedFoundationHub: A Lightweight and Secure Toolkit for Deploying Medical Vision Language Foundation Models |
MedFoundationHub:轻量安全医学视觉语言模型部署工具包,解决PHI暴露风险。 |
foundation model |
|
|
| 7 |
Veritas: Generalizable Deepfake Detection via Pattern-Aware Reasoning |
提出Veritas,通过模式感知推理实现深度伪造检测的泛化性,并使用HydraFake数据集进行评估。 |
large language model chain-of-thought |
|
|
| 8 |
R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs via Bi-Mode Annealing and Reinforce Learning |
R-4B:通过双模退火和强化学习,激励MLLM的通用自动思考能力 |
large language model multimodal |
|
|
| 9 |
GENNAV: Polygon Mask Generation for Generalized Referring Navigable Regions |
GENNAV:用于广义指代可导航区域的多边形掩码生成 |
zero-shot transfer |
|
|
| 10 |
Generalizable Object Re-Identification via Visual In-Context Prompting |
提出基于视觉上下文提示的通用物体重识别方法,无需特定类别训练。 |
foundation model |
✅ |
|
| 11 |
MMG-Vid: Maximizing Marginal Gains at Segment-level and Token-level for Efficient Video LLMs |
MMG-Vid:通过分段和Token级最大化边际收益,提升视频LLM效率 |
large language model |
|
|
| 12 |
Looking Beyond the Obvious: A Survey on Abstract Concept Recognition for Video Understanding |
综述:视频抽象概念识别,利用基础模型促进视频理解 |
foundation model |
|
|
| 13 |
Improving Alignment in LVLMs with Debiased Self-Judgment |
提出基于去偏自我判断的LVLM对齐方法,提升视觉语言模型的安全性和准确性。 |
large language model |
|
|