| 1 |
Test-Time Computing for Referring Multimodal Large Language Models |
提出ControlMLLM++,通过测试时计算实现Referring MLLM的区域级视觉推理。 |
large language model multimodal |
✅ |
|
| 2 |
Universal Pose Pretraining for Generalizable Vision-Language-Action Policies |
提出Pose-VLA,解耦视觉-语言-动作模型中的感知与动作对齐问题,提升泛化性。 |
vision-language-action VLA |
|
|
| 3 |
MICON-Bench: Benchmarking and Enhancing Multi-Image Context Image Generation in Unified Multimodal Models |
MICON-Bench:统一多模态模型中多图上下文图像生成能力的基准测试与增强 |
large language model multimodal |
✅ |
|
| 4 |
Mobile-O: Unified Multimodal Understanding and Generation on Mobile Device |
提出Mobile-O,一种在移动设备上实现统一多模态理解和生成的紧凑型模型。 |
multimodal |
✅ |
|
| 5 |
Do Large Language Models Understand Data Visualization Rules? |
评估大型语言模型理解数据可视化规则的能力,并探索其作为规则验证器的潜力。 |
large language model |
|
|
| 6 |
StructXLIP: Enhancing Vision-language Models with Multimodal Structural Cues |
StructXLIP:利用多模态结构线索增强视觉-语言模型,提升跨模态检索性能。 |
multimodal |
✅ |
|
| 7 |
Do Large Language Models Understand Data Visualization Principles? |
评估大型语言模型理解数据可视化原则的能力,并探索其在图表验证与修复中的应用。 |
large language model |
|
|
| 8 |
Closing the gap in multimodal medical representation alignment |
提出一种模态无关框架,弥合医学多模态表征对齐中的模态差距 |
multimodal |
|
|
| 9 |
CLCR: Cross-Level Semantic Collaborative Representation for Multimodal Learning |
提出跨层协同表征(CLCR)方法,解决多模态学习中的语义不对齐和误差传播问题。 |
multimodal |
|
|
| 10 |
Vinedresser3D: Agentic Text-guided 3D Editing |
Vinedresser3D:提出基于Agent的文本引导3D编辑框架,实现高质量、精确的3D资产修改。 |
large language model multimodal |
|
|
| 11 |
ApET: Approximation-Error Guided Token Compression for Efficient VLMs |
ApET:通过近似误差引导的token压缩,提升视觉语言模型效率 |
multimodal |
✅ |
|
| 12 |
Seeing Clearly, Reasoning Confidently: Plug-and-Play Remedies for Vision Language Model Blindness |
提出即插即用模块,提升视觉语言模型在罕见物体上的推理能力 |
foundation model |
|
|
| 13 |
CountEx: Fine-Grained Counting via Exemplars and Exclusion |
CountEx:通过范例和排除实现细粒度计数,解决现有方法易混淆对象的问题。 |
multimodal |
✅ |
|
| 14 |
PA-Attack: Guiding Gray-Box Attacks on LVLM Vision Encoders with Prototypes and Attention |
提出PA-Attack,通过原型引导和注意力机制增强LVLM视觉编码器的灰盒攻击。 |
multimodal |
✅ |
|