cs.CV(2024-12-11)

📊 共 21 篇论文 | 🔗 2 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (11) 支柱三:空间感知与语义 (Perception & Semantics) (5 🔗1) 支柱二:RL算法与架构 (RL & Architecture) (2) 支柱四:生成式动作 (Generative Motion) (2 🔗1) 支柱一:机器人控制 (Robot Control) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (11 篇)

#题目一句话要点标签🔗
1 Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity Visual Descriptions Euclid:利用高质量合成视觉描述增强多模态LLM的几何感知能力 large language model multimodal
2 CogNav: Cognitive Process Modeling for Object Goal Navigation with LLMs CogNav:利用LLM进行认知过程建模,显著提升ObjectNav任务性能 embodied AI large language model foundation model
3 Multimodal Approaches to Fair Image Classification: An Ethical Perspective 提出多模态融合方法,提升图像分类公平性,缓解人口统计学偏见。 multimodal
4 Illusory VQA: Benchmarking and Enhancing Multimodal Models on Visual Illusions 提出Illusory VQA,用于评估和提升多模态模型在视觉错觉上的表现。 multimodal
5 LLaVA-Zip: Adaptive Visual Token Compression with Intrinsic Image Information LLaVA-Zip:利用内在图像信息的自适应视觉Token压缩,提升多图/视频处理能力 large language model instruction following
6 Bootstrapping Language-Guided Navigation Learning with Self-Refining Data Flywheel 提出自精炼数据飞轮(SRDF),用于引导式导航学习,性能超越人类水平。 embodied AI VLN
7 FILA: Fine-Grained Vision Language Models FILA提出HyViLM,通过混合编码器和特征融合提升高分辨率图像的视觉语言模型性能 large language model multimodal
8 Doubly-Universal Adversarial Perturbations: Deceiving Vision-Language Models Across Both Images and Text with a Single Perturbation 提出双重通用对抗扰动,欺骗跨图像和文本的视觉-语言模型 large language model multimodal
9 RoomTour3D: Geometry-Aware Video-Instruction Tuning for Embodied Navigation RoomTour3D:用于具身导航的几何感知视频指令调优 VLN
10 StreamChat: Chatting with Streaming Video StreamChat:通过在解码时更新视觉上下文,增强LMMs与流视频的交互能力 multimodal
11 Position-aware Guided Point Cloud Completion with CLIP Model 提出位置感知引导的点云补全方法,利用CLIP模型提升补全质量 multimodal

🔬 支柱三:空间感知与语义 (Perception & Semantics) (5 篇)

#题目一句话要点标签🔗
12 EOV-Seg: Efficient Open-Vocabulary Panoptic Segmentation 提出EOV-Seg,一种高效的开放词汇全景分割框架,显著提升推理速度。 open-vocabulary open vocabulary
13 BLADE: Single-view Body Mesh Learning through Accurate Depth Estimation BLADE:通过精确深度估计实现单视图人体网格学习,提升近距离图像的3D姿态和2D对齐精度。 depth estimation human mesh recovery
14 Dense Depth from Event Focal Stack 提出基于事件相机焦栈的深度估计方法,解决传统方法在动态场景下的深度感知问题。 depth estimation
15 Utilizing Multi-step Loss for Single Image Reflection Removal 提出多步损失训练方法,结合RefGAN合成数据,有效提升单图像反射去除效果 depth estimation
16 Physics Based Differentiable Rendering for Inverse Problems and Beyond 综述基于物理的可微渲染技术,解决逆向问题并拓展应用场景 scene reconstruction

🔬 支柱二:RL算法与架构 (RL & Architecture) (2 篇)

#题目一句话要点标签🔗
17 Multi-level Matching Network for Multimodal Entity Linking 提出多层匹配网络M3EL,解决多模态实体链接中跨模态交互不足的问题。 representation learning contrastive learning multimodal
18 Visual Program Distillation with Template-Based Augmentation 提出基于模板增强的视觉程序蒸馏方法,降低视觉任务代码生成成本。 distillation large language model

🔬 支柱四:生成式动作 (Generative Motion) (2 篇)

#题目一句话要点标签🔗
19 GMem: A Modular Approach for Ultra-Efficient Generative Models GMem:一种用于超高效生成模型的模块化方法,显著提升训练和采样效率。 classifier-free guidance
20 ChatDyn: Language-Driven Multi-Actor Dynamics Generation in Street Scenes ChatDyn:提出基于语言指令的多智能体街景动态生成系统 physically plausible

🔬 支柱一:机器人控制 (Robot Control) (1 篇)

#题目一句话要点标签🔗
21 Physics Context Builders: A Modular Framework for Physical Reasoning in Vision-Language Models 提出物理上下文构建器(PCBs),提升视觉-语言模型在物理推理任务上的性能。 sim2real

⬅️ 返回 cs.CV 首页 · 🏠 返回主页