| 1 |
MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models |
提出MMIU基准,用于评估大型视觉语言模型在多图理解方面的能力 |
multimodal |
|
|
| 2 |
Modelling Visual Semantics via Image Captioning to extract Enhanced Multi-Level Cross-Modal Semantic Incongruity Representation with Attention for Multimodal Sarcasm Detection |
提出一种基于图像描述增强的多层次跨模态语义不一致性表示方法,用于多模态讽刺检测。 |
multimodal |
|
|
| 3 |
Target-Dependent Multimodal Sentiment Analysis Via Employing Visual-to Emotional-Caption Translation Network using Visual-Caption Pairs |
提出VECTN模型,通过视觉到情感字幕翻译增强目标依赖的多模态情感分析。 |
multimodal |
|
|
| 4 |
Geometric Algebra Meets Large Language Models: Instruction-Based Transformations of Separate Meshes in 3D, Interactive and Controllable Scenes |
提出Shenlong,结合LLM与CGA实现交互式3D场景中精确可控的物体重定位。 |
large language model |
|
|
| 5 |
Fairness and Bias Mitigation in Computer Vision: A Survey |
计算机视觉公平性与偏见缓解综述:总结现有方法并展望未来趋势 |
multimodal |
|
|
| 6 |
Infusing Environmental Captions for Long-Form Video Language Grounding |
提出EI-VLG,利用环境字幕增强长视频语言定位,有效排除无关帧。 |
large language model |
|
|
| 7 |
Evaluating Vision-Language Models for Zero-Shot Detection, Classification, and Association of Motorcycles, Passengers, and Helmets |
利用OWLv2零样本检测摩托车、乘客及头盔佩戴情况,助力交通安全 |
foundation model |
|
|
| 8 |
ExoViP: Step-by-step Verification and Exploration with Exoskeleton Modules for Compositional Visual Reasoning |
ExoViP:利用外骨骼模块进行逐步验证与探索,提升组合式视觉推理能力 |
large language model |
|
|