| 1 |
Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity Visual Descriptions |
Euclid:利用高质量合成视觉描述增强多模态LLM的几何感知能力 |
large language model multimodal |
|
|
| 2 |
CogNav: Cognitive Process Modeling for Object Goal Navigation with LLMs |
CogNav:利用LLM进行认知过程建模,显著提升ObjectNav任务性能 |
embodied AI large language model foundation model |
|
|
| 3 |
Multimodal Approaches to Fair Image Classification: An Ethical Perspective |
提出多模态融合方法,提升图像分类公平性,缓解人口统计学偏见。 |
multimodal |
|
|
| 4 |
Illusory VQA: Benchmarking and Enhancing Multimodal Models on Visual Illusions |
提出Illusory VQA,用于评估和提升多模态模型在视觉错觉上的表现。 |
multimodal |
|
|
| 5 |
LLaVA-Zip: Adaptive Visual Token Compression with Intrinsic Image Information |
LLaVA-Zip:利用内在图像信息的自适应视觉Token压缩,提升多图/视频处理能力 |
large language model instruction following |
|
|
| 6 |
Bootstrapping Language-Guided Navigation Learning with Self-Refining Data Flywheel |
提出自精炼数据飞轮(SRDF),用于引导式导航学习,性能超越人类水平。 |
embodied AI VLN |
|
|
| 7 |
FILA: Fine-Grained Vision Language Models |
FILA提出HyViLM,通过混合编码器和特征融合提升高分辨率图像的视觉语言模型性能 |
large language model multimodal |
|
|
| 8 |
Doubly-Universal Adversarial Perturbations: Deceiving Vision-Language Models Across Both Images and Text with a Single Perturbation |
提出双重通用对抗扰动,欺骗跨图像和文本的视觉-语言模型 |
large language model multimodal |
|
|
| 9 |
RoomTour3D: Geometry-Aware Video-Instruction Tuning for Embodied Navigation |
RoomTour3D:用于具身导航的几何感知视频指令调优 |
VLN |
|
|
| 10 |
StreamChat: Chatting with Streaming Video |
StreamChat:通过在解码时更新视觉上下文,增强LMMs与流视频的交互能力 |
multimodal |
|
|
| 11 |
Position-aware Guided Point Cloud Completion with CLIP Model |
提出位置感知引导的点云补全方法,利用CLIP模型提升补全质量 |
multimodal |
|
|