cs.CV（2024-09-18）

📊 共 26 篇论文 | 🔗 9 篇有代码

🎯 兴趣领域导航

支柱三：空间感知与语义 (Perception & Semantics) (10 🔗6) 支柱九：具身大模型 (Embodied Foundation Models) (6 🔗1) 支柱二：RL算法与架构 (RL & Architecture) (5 🔗1) 支柱一：机器人控制 (Robot Control) (2) 支柱四：生成式动作 (Generative Motion) (1) 支柱七：动作重定向 (Motion Retargeting) (1) 支柱六：视频提取与匹配 (Video Extraction) (1 🔗1)

🔬 支柱三：空间感知与语义 (Perception & Semantics) (10 篇)

#	题目	一句话要点	标签	🔗	⭐
1	Depth Estimation Based on 3D Gaussian Splatting Siamese Defocus	提出基于3D高斯溅射和Siamese网络的自监督散焦深度估计框架	depth estimation monocular depth stereo depth
2	Gradient-Driven 3D Segmentation and Affordance Transfer in Gaussian Splatting Using 2D Masks	提出基于梯度驱动的3D高斯分割与可供性迁移方法，提升3D场景理解能力。	3D gaussian splatting 3DGS gaussian splatting	✅
3	BRDF-NeRF: Neural Radiance Fields with Optical Satellite Images and BRDF Modelling	提出BRDF-NeRF以解决卫星图像中BRDF建模问题	NeRF neural radiance field
4	Optical Flow Matters: an Empirical Comparative Study on Fusing Monocular Extracted Modalities for Better Steering	提出单目多模态融合的端到端自动驾驶转向预测方法，显著提升转向精度。	optical flow multimodal
5	LLM-wrapper: Black-Box Semantic-Aware Adaptation of Vision-Language Models for Referring Expression Comprehension	提出LLM-wrapper，利用大语言模型黑盒适配视觉语言模型，提升指代表达理解性能。	open-vocabulary open vocabulary large language model	✅
6	ORB-SfMLearner: ORB-Guided Self-supervised Visual Odometry with Selective Online Adaptation	提出ORB引导的自监督视觉里程计，通过选择性在线自适应提升泛化性。	visual odometry	✅
7	SRIF: Semantic Shape Registration Empowered by Diffusion-based Image Morphing and Flow Estimation	提出SRIF，利用扩散模型图像形变和光流估计实现语义形状配准	3D gaussian splatting gaussian splatting splatting	✅
8	Vista3D: Unravel the 3D Darkside of a Single Image	Vista3D：提出快速且一致的单图像3D生成框架，揭示物体隐藏的3D信息。	gaussian splatting splatting	✅
9	Panoptic-Depth Forecasting	提出Panoptic-Depth Forecasting任务，用于预测未来帧的全景分割和深度图，提升机器人导航安全性。	depth estimation
10	DAF-Net: A Dual-Branch Feature Decomposition Fusion Network with Domain Adaptive for Infrared and Visible Image Fusion	提出DAF-Net，通过双分支特征分解和领域自适应实现红外与可见光图像融合	scene understanding	✅

🔬 支柱九：具身大模型 (Embodied Foundation Models) (6 篇)

#	题目	一句话要点	标签	🔗	⭐
11	ChefFusion: Multimodal Foundation Model Integrating Recipe and Food Image Generation	ChefFusion：融合食谱与食物图像生成的多模态基础模型	large language model foundation model multimodal
12	Large Language Models are Strong Audio-Visual Speech Recognition Learners	提出Llama-AVSR，利用多模态LLM实现卓越的语音和视听语音识别	large language model multimodal
13	Cross-Organ and Cross-Scanner Adenocarcinoma Segmentation using Rein to Fine-tune Vision Foundation Models	提出Rein微调方法，高效适配视觉基础模型，解决跨器官和扫描仪的腺癌分割问题	foundation model
14	Free-VSC: Free Semantics from Visual Foundation Models for Unsupervised Video Semantic Compression	提出Free-VSC，利用视觉基础模型语义增强无监督视频语义压缩	foundation model
15	Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution	Qwen2-VL：通过动态分辨率增强视觉语言模型对世界的感知	multimodal	✅
16	Knowledge Adaptation Network for Few-Shot Class-Incremental Learning	提出知识自适应网络KANet，解决少样本类增量学习中的表示偏差问题	foundation model

🔬 支柱二：RL算法与架构 (RL & Architecture) (5 篇)

#	题目	一句话要点	标签	🔗	⭐
17	StableMamba: Distillation-free Scaling of Large SSMs for Images and Videos	提出StableMamba，一种无需蒸馏即可扩展大规模SSM用于图像和视频任务的架构	Mamba SSM distillation
18	Multimodal Generalized Category Discovery	提出MM-GCD框架，通过对齐特征和输出空间解决多模态广义类别发现问题	contrastive learning distillation multimodal
19	JEAN: Joint Expression and Audio-guided NeRF-based Talking Face Generation	提出JEAN，一种基于NeRF的联合表情和音频引导的说话人脸生成方法	contrastive learning NeRF
20	PhysMamba: Efficient Remote Physiological Measurement with SlowFast Temporal Difference Mamba	PhysMamba：利用时序差分Mamba高效实现面部视频的远程生理信号测量	Mamba SSM state space model	✅
21	DETECLAP: Enhancing Audio-Visual Representation Learning with Object Information	DETECLAP：利用对象信息增强音视频表征学习，提升细粒度识别能力	representation learning masked autoencoder

🔬 支柱一：机器人控制 (Robot Control) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
22	FAST GDRNPP: Improving the Speed of State-of-the-Art 6D Object Pose Estimation	提出FAST GDRNPP，加速6D物体姿态估计，兼顾精度与速度。	manipulation distillation
23	Controllable Shape Modeling with Neural Generalized Cylinder	提出神经广义柱体(NGC)用于可控的神经隐式形状建模	manipulation

🔬 支柱四：生成式动作 (Generative Motion) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
24	MoRAG -- Multi-Fusion Retrieval Augmented Generation for Human Motion	MoRAG：提出一种基于多部分融合检索增强生成的人体运动生成方法。	motion diffusion model motion diffusion motion generation

🔬 支柱七：动作重定向 (Motion Retargeting) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
25	LFIC-DRASC: Deep Light Field Image Compression Using Disentangled Representation and Asymmetrical Strip Convolution	提出LFIC-DRASC，利用解耦表示和非对称条形卷积实现高效光场图像压缩。	spatial relationship

🔬 支柱六：视频提取与匹配 (Video Extraction) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
26	WiLoR: End-to-end 3D Hand Localization and Reconstruction in-the-wild	WiLoR：提出一个端到端框架，用于野外环境下的3D手部定位与重建。	hand reconstruction	✅

⬅️ 返回 cs.CV 首页 · 🏠 返回主页