| 1 |
MIMO: A medical vision language model with visual referring multimodal input and pixel grounding multimodal output |
MIMO:一种具有视觉指代多模态输入和像素级定位多模态输出的医学视觉语言模型 |
multimodal instruction following |
|
|
| 2 |
Vision4PPG: Emergent PPG Analysis Capability of Vision Foundation Models for Vital Signs like Blood Pressure |
Vision4PPG:利用视觉基础模型进行PPG分析,实现血压等生命体征的预测 |
foundation model |
|
|
| 3 |
ESCA: Contextualizing Embodied Agents via Scene-Graph Generation |
提出ESCA框架,通过场景图生成增强具身智能体的上下文感知能力 |
large language model foundation model |
✅ |
|
| 4 |
From Generic to Specialized: A Subspecialty Diagnostic System Powered by Self-Supervised Learning for Cervical Histopathology |
CerS-Path:基于自监督学习的宫颈组织病理亚专科诊断系统 |
foundation model multimodal |
|
|
| 5 |
CoIDO: Efficient Data Selection for Visual Instruction Tuning via Coupled Importance-Diversity Optimization |
CoIDO:通过耦合重要性-多样性优化实现视觉指令调优的高效数据选择 |
large language model multimodal |
|
|
| 6 |
Q-Adapter: Visual Query Adapter for Extracting Textually-related Features in Video Captioning |
提出Q-Adapter,通过可学习查询token高效提取视频字幕相关视觉特征,实现参数高效的视频字幕生成。 |
large language model multimodal |
|
|
| 7 |
Scaling Traffic Insights with AI and Language Model-Powered Camera Systems for Data-Driven Transportation Decision Making |
提出基于AI和语言模型的交通摄像头系统,用于大规模交通洞察和数据驱动的决策 |
large language model multimodal |
|
|
| 8 |
EditCast3D: Single-Frame-Guided 3D Editing with Video Propagation and View Selection |
EditCast3D:利用视频传播和视图选择实现单帧引导的3D编辑 |
foundation model |
✅ |
|
| 9 |
From Programs to Poses: Factored Real-World Scene Generation via Learned Program Libraries |
FactoredScenes:通过学习程序库生成可分解的真实世界场景,解决数据稀缺问题。 |
large language model |
|
|