| 1 |
VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks |
VisionLLM v2:提出通用多模态大语言模型,统一视觉感知、理解和生成任务。 |
large language model multimodal |
|
|
| 2 |
OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text |
提出OmniCorpus,一个包含百亿级图像与文本交错的大规模多模态数据集,促进多模态大语言模型发展。 |
large language model multimodal |
✅ |
|
| 3 |
Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models |
SliME:面向高分辨率图像,通过局部压缩和全局专家混合提升多模态大模型性能 |
multimodal |
|
|
| 4 |
LLM-assisted Concept Discovery: Automatically Identifying and Explaining Neuron Functions |
提出LLM辅助的概念发现方法,自动识别并解释神经网络神经元功能 |
large language model multimodal |
|
|
| 5 |
Real2Code: Reconstruct Articulated Objects via Code Generation |
Real2Code:通过代码生成重建铰接物体,突破复杂度和真实场景限制。 |
large language model |
|
|
| 6 |
GUIOdyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices |
提出GUIOdyssey数据集,用于提升移动设备跨应用GUI导航Agent性能 |
multimodal |
|
|
| 7 |
APSeg: Auto-Prompt Network for Cross-Domain Few-Shot Semantic Segmentation |
APSeg:用于跨域少样本语义分割的自动提示网络 |
foundation model |
|
|
| 8 |
Refusal as Silence: Gendered Disparities in Vision-Language Model Responses |
通过性别化身份提示,揭示视觉语言模型拒绝行为中的性别歧视 |
large language model |
|
|