| 1 |
PixFoundation: Are We Heading in the Right Direction with Pixel-level Vision Foundation Models? |
PixFoundation:揭示像素级视觉基础模型在视觉问答和定位能力上的局限性,并探索无像素级监督的MLLM的潜力。 |
large language model foundation model |
|
|
| 2 |
LeAP: Consistent multi-domain 3D labeling using Foundation Models |
LeAP:利用Foundation Model实现多领域一致性3D点云自动标注 |
foundation model |
|
|
| 3 |
WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs |
提出WorldSense基准,用于评估多模态LLM在真实世界场景下的全模态理解能力。 |
multimodal |
|
|
| 4 |
A Self-supervised Multimodal Deep Learning Approach to Differentiate Post-radiotherapy Progression from Pseudoprogression in Glioblastoma |
提出一种自监督多模态深度学习方法,用于区分胶质母细胞瘤放疗后的真性进展与假性进展。 |
multimodal |
|
|
| 5 |
LR0.FM: Low-Res Benchmark and Improving Robustness for Zero-Shot Classification in Foundation Models |
LR0.FM:低分辨率图像下提升视觉语言基础模型零样本分类鲁棒性 |
foundation model |
✅ |
|
| 6 |
No Free Lunch in Annotation either: An objective evaluation of foundation models for streamlining annotation in animal tracking |
针对动物追踪,论文客观评估了基础模型在简化标注任务中的有效性。 |
foundation model |
|
|
| 7 |
FairT2I: Mitigating Social Bias in Text-to-Image Generation via Large Language Model-Assisted Detection and Attribute Rebalancing |
FairT2I通过大语言模型辅助检测和属性重平衡缓解文本到图像生成中的社会偏见。 |
large language model |
|
|
| 8 |
Time-VLM: Exploring Multimodal Vision-Language Models for Augmented Time Series Forecasting |
提出Time-VLM,利用多模态视觉-语言模型增强时间序列预测。 |
multimodal |
✅ |
|
| 9 |
Color in Visual-Language Models: CLIP deficiencies |
揭示CLIP在颜色理解上的缺陷:对非彩色刺激的偏见与文本优先倾向 |
multimodal |
|
|
| 10 |
Ola: Pushing the Frontiers of Omni-Modal Language Model |
Ola:一种全模态语言模型,在图像、视频和音频理解方面达到与专用模型相媲美的性能。 |
large language model |
✅ |
|
| 11 |
Keep It Light! Simplifying Image Clustering Via Text-Free Adapters |
SCP:通过无文本适配器简化图像聚类,实现媲美SOTA的性能 |
large language model |
|
|
| 12 |
CAD-Editor: A Locate-then-Infill Framework with Automated Training Data Synthesis for Text-Based CAD Editing |
提出CAD-Editor框架,通过自动数据合成和locate-then-infill策略实现文本驱动的CAD模型编辑。 |
large language model |
✅ |
|
| 13 |
RWKV-UI: UI Understanding with Enhanced Perception and Reasoning |
提出RWKV-UI,增强视觉语言模型在UI理解和交互推理中的性能 |
chain-of-thought |
|
|