cs.CV(2024-07-25)

📊 共 24 篇论文 | 🔗 6 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (12 🔗4) 支柱二:RL算法与架构 (RL & Architecture) (6 🔗1) 支柱三:空间感知与语义 (Perception & Semantics) (3) 支柱一:机器人控制 (Robot Control) (2 🔗1) 支柱六:视频提取与匹配 (Video Extraction) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (12 篇)

#题目一句话要点标签🔗
1 RestoreAgent: Autonomous Image Restoration Agent via Multimodal Large Language Models 提出RestoreAgent,利用多模态大语言模型实现自主图像修复,解决复杂退化问题。 large language model multimodal
2 Efficient Inference of Vision Instruction-Following Models with Elastic Cache 提出Elastic Cache,加速视觉指令跟随模型推理,降低KV缓存内存需求 multimodal instruction following
3 Retinal IPA: Iterative KeyPoints Alignment for Multimodal Retinal Imaging 提出Retinal IPA,用于多模态视网膜图像配准的关键点对齐 multimodal
4 Sparse vs Contiguous Adversarial Pixel Perturbations in Multimodal Models: An Empirical Analysis 研究多模态模型在稀疏与连续对抗像素扰动下的鲁棒性 multimodal
5 KiVA: Kid-inspired Visual Analogies for Testing Large Multimodal Models 提出KiVA基准以测试大型多模态模型的视觉类比推理能力 multimodal
6 ERIT Lightweight Multimodal Dataset for Elderly Emotion Recognition and Multimodal Fusion Evaluation ERIT:用于老年人情感识别和多模态融合评估的轻量级多模态数据集 multimodal
7 Enhancing Model Performance: Another Approach to Vision-Language Instruction Tuning 提出Bottleneck Adapter,用于增强视觉-语言指令调优模型性能 large language model multimodal
8 RefMask3D: Language-Guided Transformer for 3D Referring Segmentation RefMask3D:一种用于3D指代表达分割的语言引导Transformer网络 visual grounding
9 MARINE: A Computer Vision Model for Detecting Rare Predator-Prey Interactions in Animal Videos MARINE:用于检测动物视频中罕见捕食者-猎物交互的计算机视觉模型 foundation model
10 Unified Lexical Representation for Interpretable Visual-Language Alignment 提出LexVLA,通过统一词汇表征实现可解释的视觉-语言对齐。 VLA
11 A Unified Understanding of Adversarial Vulnerability Regarding Unimodal Models and Vision-Language Pre-training Models 提出特征引导攻击FGA及其改进FGA-T,用于评估和提升视觉-语言预训练模型的鲁棒性 multimodal
12 DAC: 2D-3D Retrieval with Noisy Labels via Divide-and-Conquer Alignment and Correction 提出DAC框架,通过分而治之的对齐和校正方法解决带噪标签的2D-3D跨模态检索问题。 multimodal

🔬 支柱二:RL算法与架构 (RL & Architecture) (6 篇)

#题目一句话要点标签🔗
13 Leveraging Foundation Models via Knowledge Distillation in Multi-Object Tracking: Distilling DINOv2 Features to FairMOT 利用知识蒸馏,将DINOv2特征迁移至FairMOT,提升多目标跟踪性能 teacher-student distillation foundation model
14 $\mathbb{X}$-Sample Contrastive Loss: Improving Contrastive Learning with Sample Similarity Graphs 提出$ extbf{X}$-样本对比损失以改善对比学习 contrastive learning foundation model multimodal
15 HVM-1: Large-scale video models pretrained with nearly 5000 hours of human-like video data 提出HVM-1,利用近5000小时类人视频数据预训练大规模视频模型,提升视频和图像识别能力。 masked autoencoder MAE egocentric
16 PianoMime: Learning a Generalist, Dexterous Piano Player from Internet Demonstrations PianoMime:利用互联网视频学习通用型钢琴演奏机器人 policy learning distillation generalist agent
17 ALMRR: Anomaly Localization Mamba on Industrial Textured Surface with Feature Reconstruction and Refinement 提出基于Mamba的ALMRR模型,用于工业纹理表面缺陷的无监督异常定位。 Mamba
18 Harnessing Temporal Causality for Advanced Temporal Action Detection CausalTAD:利用时序因果关系提升时间动作检测性能 Mamba Ego4D

🔬 支柱三:空间感知与语义 (Perception & Semantics) (3 篇)

#题目一句话要点标签🔗
19 GaussianSR: High Fidelity 2D Gaussian Splatting for Arbitrary-Scale Image Super-Resolution 提出GaussianSR,利用2D高斯溅射实现任意尺度图像超分辨率重建 gaussian splatting splatting
20 BetterDepth: Plug-and-Play Diffusion Refiner for Zero-Shot Monocular Depth Estimation BetterDepth:即插即用的扩散细化器,用于零样本单目深度估计 depth estimation monocular depth
21 UMono: Physical Model Informed Hybrid CNN-Transformer Framework for Underwater Monocular Depth Estimation UMono:水下单目深度估计的物理模型驱动混合CNN-Transformer框架 depth estimation monocular depth

🔬 支柱一:机器人控制 (Robot Control) (2 篇)

#题目一句话要点标签🔗
22 Move and Act: Enhanced Object Manipulation and Background Integrity for Image Editing 提出Move and Act,实现可控对象操作和背景完整性增强的图像编辑 manipulation
23 DragText: Rethinking Text Embedding in Point-based Image Editing DragText:通过优化文本嵌入增强基于点的图像编辑 manipulation

🔬 支柱六:视频提取与匹配 (Video Extraction) (1 篇)

#题目一句话要点标签🔗
24 AttentionHand: Text-driven Controllable Hand Image Generation for 3D Hand Reconstruction in the Wild AttentionHand:提出文本驱动的可控手部图像生成方法,用于提升野外场景下的3D手部重建。 hand reconstruction

⬅️ 返回 cs.CV 首页 · 🏠 返回主页