cs.CV（2024-07-25）

📊 共 24 篇论文 | 🔗 6 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (12 🔗4) 支柱二：RL算法与架构 (RL & Architecture) (6 🔗1) 支柱三：空间感知与语义 (Perception & Semantics) (3) 支柱一：机器人控制 (Robot Control) (2 🔗1) 支柱六：视频提取与匹配 (Video Extraction) (1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (12 篇)

#	题目	一句话要点	标签	🔗	⭐
1	RestoreAgent: Autonomous Image Restoration Agent via Multimodal Large Language Models	提出RestoreAgent，利用多模态大语言模型实现自主图像修复，解决复杂退化问题。	large language model multimodal
2	Efficient Inference of Vision Instruction-Following Models with Elastic Cache	提出Elastic Cache，加速视觉指令跟随模型推理，降低KV缓存内存需求	multimodal instruction following	✅
3	Retinal IPA: Iterative KeyPoints Alignment for Multimodal Retinal Imaging	提出Retinal IPA，用于多模态视网膜图像配准的关键点对齐	multimodal	✅
4	Sparse vs Contiguous Adversarial Pixel Perturbations in Multimodal Models: An Empirical Analysis	研究多模态模型在稀疏与连续对抗像素扰动下的鲁棒性	multimodal
5	KiVA: Kid-inspired Visual Analogies for Testing Large Multimodal Models	提出KiVA基准以测试大型多模态模型的视觉类比推理能力	multimodal
6	ERIT Lightweight Multimodal Dataset for Elderly Emotion Recognition and Multimodal Fusion Evaluation	ERIT：用于老年人情感识别和多模态融合评估的轻量级多模态数据集	multimodal
7	Enhancing Model Performance: Another Approach to Vision-Language Instruction Tuning	提出Bottleneck Adapter，用于增强视觉-语言指令调优模型性能	large language model multimodal
8	RefMask3D: Language-Guided Transformer for 3D Referring Segmentation	RefMask3D：一种用于3D指代表达分割的语言引导Transformer网络	visual grounding	✅
9	MARINE: A Computer Vision Model for Detecting Rare Predator-Prey Interactions in Animal Videos	MARINE：用于检测动物视频中罕见捕食者-猎物交互的计算机视觉模型	foundation model
10	Unified Lexical Representation for Interpretable Visual-Language Alignment	提出LexVLA，通过统一词汇表征实现可解释的视觉-语言对齐。	VLA	✅
11	A Unified Understanding of Adversarial Vulnerability Regarding Unimodal Models and Vision-Language Pre-training Models	提出特征引导攻击FGA及其改进FGA-T，用于评估和提升视觉-语言预训练模型的鲁棒性	multimodal
12	DAC: 2D-3D Retrieval with Noisy Labels via Divide-and-Conquer Alignment and Correction	提出DAC框架，通过分而治之的对齐和校正方法解决带噪标签的2D-3D跨模态检索问题。	multimodal

🔬 支柱二：RL算法与架构 (RL & Architecture) (6 篇)

#	题目	一句话要点	标签	🔗	⭐
13	Leveraging Foundation Models via Knowledge Distillation in Multi-Object Tracking: Distilling DINOv2 Features to FairMOT	利用知识蒸馏，将DINOv2特征迁移至FairMOT，提升多目标跟踪性能	teacher-student distillation foundation model
14	$\mathbb{X}$-Sample Contrastive Loss: Improving Contrastive Learning with Sample Similarity Graphs	提出$ extbf{X}$-样本对比损失以改善对比学习	contrastive learning foundation model multimodal
15	HVM-1: Large-scale video models pretrained with nearly 5000 hours of human-like video data	提出HVM-1，利用近5000小时类人视频数据预训练大规模视频模型，提升视频和图像识别能力。	masked autoencoder MAE egocentric
16	PianoMime: Learning a Generalist, Dexterous Piano Player from Internet Demonstrations	PianoMime：利用互联网视频学习通用型钢琴演奏机器人	policy learning distillation generalist agent
17	ALMRR: Anomaly Localization Mamba on Industrial Textured Surface with Feature Reconstruction and Refinement	提出基于Mamba的ALMRR模型，用于工业纹理表面缺陷的无监督异常定位。	Mamba
18	Harnessing Temporal Causality for Advanced Temporal Action Detection	CausalTAD：利用时序因果关系提升时间动作检测性能	Mamba Ego4D	✅

🔬 支柱三：空间感知与语义 (Perception & Semantics) (3 篇)

#	题目	一句话要点	标签	🔗	⭐
19	GaussianSR: High Fidelity 2D Gaussian Splatting for Arbitrary-Scale Image Super-Resolution	提出GaussianSR，利用2D高斯溅射实现任意尺度图像超分辨率重建	gaussian splatting splatting
20	BetterDepth: Plug-and-Play Diffusion Refiner for Zero-Shot Monocular Depth Estimation	BetterDepth：即插即用的扩散细化器，用于零样本单目深度估计	depth estimation monocular depth
21	UMono: Physical Model Informed Hybrid CNN-Transformer Framework for Underwater Monocular Depth Estimation	UMono：水下单目深度估计的物理模型驱动混合CNN-Transformer框架	depth estimation monocular depth

🔬 支柱一：机器人控制 (Robot Control) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
22	Move and Act: Enhanced Object Manipulation and Background Integrity for Image Editing	提出Move and Act，实现可控对象操作和背景完整性增强的图像编辑	manipulation	✅
23	DragText: Rethinking Text Embedding in Point-based Image Editing	DragText：通过优化文本嵌入增强基于点的图像编辑	manipulation

🔬 支柱六：视频提取与匹配 (Video Extraction) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
24	AttentionHand: Text-driven Controllable Hand Image Generation for 3D Hand Reconstruction in the Wild	AttentionHand：提出文本驱动的可控手部图像生成方法，用于提升野外场景下的3D手部重建。	hand reconstruction

⬅️ 返回 cs.CV 首页 · 🏠 返回主页