| 1 |
ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding |
ReFocus:通过视觉编辑的思维链实现结构化图像理解 |
large language model multimodal chain-of-thought |
|
|
| 2 |
Can MLLMs Reason in Multimodality? EMMA: An Enhanced MultiModal ReAsoning Benchmark |
提出EMMA:增强多模态推理基准,评估MLLM在复杂跨模态推理中的能力 |
large language model multimodal chain-of-thought |
|
|
| 3 |
Atlas: A Novel Pathology Foundation Model by Mayo Clinic, Charité, and Aignostics |
Atlas:Mayo Clinic、Charité和Aignostics联合提出的新型病理学基础模型 |
foundation model |
|
|
| 4 |
CellViT++: Energy-Efficient and Adaptive Cell Segmentation and Classification Using Foundation Models |
提出CellViT++以解决数字病理中细胞分割与分类问题 |
foundation model |
✅ |
|
| 5 |
LLaVA-Octopus: Unlocking Instruction-Driven Adaptive Projector Fusion for Video Understanding |
LLaVA-Octopus:指令驱动的自适应投影器融合用于视频理解 |
large language model multimodal |
|
|
| 6 |
V2C-CBM: Building Concept Bottlenecks with Vision-to-Concept Tokenizer |
提出V2C-CBM,利用视觉-概念Tokenizer构建高效且可解释的概念瓶颈模型 |
large language model multimodal |
|
|
| 7 |
OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding? |
OVO-Bench:提出在线视频理解基准,评估视频LLM的时间感知能力。 |
TAMP |
✅ |
|
| 8 |
Comparison Study: Glacier Calving Front Delineation in Synthetic Aperture Radar Images With Deep Learning |
对比深度学习模型在SAR图像冰川崩解前沿识别中的性能,揭示其与人工标注的差距。 |
foundation model |
|
|
| 9 |
Harnessing Large Language and Vision-Language Models for Robust Out-of-Distribution Detection |
利用大语言模型和视觉-语言模型增强鲁棒的分布外检测能力 |
large language model |
|
|
| 10 |
A Flexible and Scalable Framework for Video Moment Search |
提出SPR框架,解决长视频中高效灵活的排序视频片段检索问题 |
TAMP |
|
|
| 11 |
Seeing with Partial Certainty: Conformal Prediction for Robotic Scene Recognition in Built Environments |
提出SwPC框架,利用Conformal Prediction提升VLM在机器人场景识别中的置信度与准确率。 |
large language model |
|
|