| 1 |
CF-VLA: Efficient Coarse-to-Fine Action Generation for Vision-Language-Action Policies |
提出CF-VLA,通过粗到精的两阶段动作生成方法提升视觉-语言-动作策略的效率。 |
embodied AI vision-language-action VLA |
✅ |
|
| 2 |
QEVA: A Reference-Free Evaluation Metric for Narrative Video Summarization with Multimodal Question Answering |
提出QEVA:一种基于多模态问答的叙事视频摘要无参考评价指标 |
large language model multimodal |
|
|
| 3 |
Hierarchical Prototype-based Domain Priors for Multiple Instance Learning in Multimodal Histopathology Analysis |
提出HPDP框架,利用层级原型和领域先验提升多模态病理图像MIL分析性能。 |
large language model multimodal |
|
|
| 4 |
Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation |
Tuna-2:像素嵌入超越视觉编码器,实现多模态理解与生成 |
multimodal |
|
|
| 5 |
Benchmarking Pathology Foundation Models for Breast Cancer Survival Prediction |
大规模基准测试病理学预训练模型在乳腺癌生存预测中的性能 |
foundation model |
|
|
| 6 |
Global Context or Local Detail? Adaptive Visual Grounding for Hallucination Mitigation |
提出Positive-and-Negative Decoding框架,缓解视觉语言模型中的对象幻觉问题 |
visual grounding |
|
|
| 7 |
EXACT: an explainable anomaly-aware vision foundation model for analysis of 3D chest CT |
EXACT:用于3D胸部CT分析的可解释异常感知视觉基础模型 |
foundation model |
|
|
| 8 |
Robust Grounding with MLLMs against Occlusion and Small Objects via Language-guided Semantic Cues |
提出语言引导语义线索,提升MLLM在遮挡和小物体场景下的鲁棒性 |
large language model multimodal |
|
|
| 9 |
NeuroClaw Technical Report |
NeuroClaw:用于可执行和可复现神经影像研究的领域专用多智能体研究助手 |
multimodal |
✅ |
|
| 10 |
Meta-CoT: Enhancing Granularity and Generalization in Image Editing |
Meta-CoT:通过细粒度和泛化能力增强图像编辑 |
chain-of-thought |
✅ |
|
| 11 |
Majorization-Guided Test-Time Adaptation for Vision-Language Models under Modality-Specific Shift |
提出MG-MTTA,解决视觉-语言模型在模态特定偏移下的测试时自适应问题 |
multimodal |
|
|
| 12 |
Zero-to-CAD: Agentic Synthesis of Interpretable CAD Programs at Million-Scale Without Real Data |
Zero-to-CAD:无需真实数据,百万规模合成可解释的CAD程序 |
large language model |
|
|
| 13 |
Don't Pause! Every prediction matters in a streaming video |
提出SPOT-Bench评估流视频理解模型的实时性,并提出AsynKV提升性能。 |
TAMP |
|
|
| 14 |
SMoES: Soft Modality-Guided Expert Specialization in MoE-VLMs |
提出SMoES,通过模态引导专家特化提升MoE-VLM的性能与效率 |
multimodal |
|
|
| 15 |
LearnPruner: Rethinking Attention-based Token Pruning in Vision Language Models |
LearnPruner:重新思考视觉语言模型中基于注意力的Token剪枝 |
large language model |
|
|
| 16 |
GoClick: Lightweight Element Grounding Model for Autonomous GUI Interaction |
提出GoClick轻量级GUI元素定位模型,用于资源受限设备上的自主GUI交互。 |
visual grounding |
|
|