cs.CV（2024-08-26）

📊 共 17 篇论文 | 🔗 7 篇有代码

🎯 兴趣领域导航

支柱二：RL算法与架构 (RL & Architecture) (5 🔗1) 支柱九：具身大模型 (Embodied Foundation Models) (4 🔗3) 支柱三：空间感知与语义 (Perception & Semantics) (3 🔗1) 支柱六：视频提取与匹配 (Video Extraction) (2) 支柱一：机器人控制 (Robot Control) (2 🔗1) 支柱八：物理动画 (Physics-based Animation) (1 🔗1)

🔬 支柱二：RL算法与架构 (RL & Architecture) (5 篇)

#	题目	一句话要点	标签	🔗	⭐
1	ShapeMamba-EM: Fine-Tuning Foundation Model with Local Shape Descriptors and Mamba Blocks for 3D EM Image Segmentation	ShapeMamba-EM：结合局部形状描述子与Mamba块微调基础模型，用于3D EM图像分割	Mamba foundation model
2	LoG-VMamba: Local-Global Vision Mamba for Medical Image Segmentation	LoG-VMamba：用于医学图像分割的局部-全局视觉Mamba模型	Mamba SSM state space model	✅
3	Driving in the Occupancy World: Vision-Centric 4D Occupancy Forecasting and Planning via World Models for Autonomous Driving	Drive-OccWorld：提出视觉中心4D Occupancy预测与规划的世界模型，用于自动驾驶。	world model spatiotemporal
4	Global-Local Distillation Network-Based Audio-Visual Speaker Tracking with Incomplete Modalities	提出基于全局-局部蒸馏网络的音视频说话人跟踪方法，解决模态缺失下的鲁棒跟踪问题。	teacher-student distillation
5	Let Video Teaches You More: Video-to-Image Knowledge Distillation using DEtection TRansformer for Medical Video Lesion Detection	提出V2I-DETR，利用视频知识蒸馏提升医学视频病灶检测效率与精度。	teacher-student distillation

🔬 支柱九：具身大模型 (Embodied Foundation Models) (4 篇)

#	题目	一句话要点	标签	🔗	⭐
6	A Practitioner's Guide to Continual Multimodal Pretraining	提出FoMo-in-Flux基准，为多模态预训练模型在实际部署中的持续更新提供指导。	foundation model multimodal	✅
7	MMR: Evaluating Reading Ability of Large Multimodal Models	提出多模态阅读基准MMR，用于评估大型多模态模型在文本丰富图像中的阅读理解能力。	multimodal
8	An Embedding is Worth a Thousand Noisy Labels	提出WANN：利用自监督特征和可靠性评分，有效应对带噪标签问题	foundation model	✅
9	Video-CCAM: Enhancing Video-Language Understanding with Causal Cross-Attention Masks for Short and Long Videos	Video-CCAM：利用因果交叉注意力掩码增强视频语言理解能力，适用于短视频和长视频	large language model	✅

🔬 支柱三：空间感知与语义 (Perception & Semantics) (3 篇)

#	题目	一句话要点	标签	🔗	⭐
10	DynaSurfGS: Dynamic Surface Reconstruction with Planar-based Gaussian Splatting	DynaSurfGS：基于平面高斯溅射的动态表面重建方法	gaussian splatting splatting scene reconstruction
11	NimbleD: Enhancing Self-supervised Monocular Depth Estimation with Pseudo-labels and Large-scale Video Pre-training	NimbleD：利用伪标签和大规模视频预训练提升自监督单目深度估计	depth estimation monocular depth	✅
12	Avatar Concept Slider: Controllable Editing of Concepts in 3D Human Avatars	提出Avatar Concept Slider，实现3D人体Avatar概念的可控编辑	3D gaussian splatting gaussian splatting splatting

🔬 支柱六：视频提取与匹配 (Video Extraction) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
13	Grounded Multi-Hop VideoQA in Long-Form Egocentric Videos	提出GeLM模型，解决长时第一视角视频多跳问答中的时序定位与推理难题	egocentric large language model
14	MagicMan: Generative Novel View Synthesis of Humans with 3D-Aware Diffusion and Iterative Refinement	MagicMan：利用3D感知扩散和迭代优化实现人体新视角合成	SMPL SMPL-X

🔬 支柱一：机器人控制 (Robot Control) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
15	Center Direction Network for Grasping Point Localization on Cloths	提出CeDiRNet-3DoF，用于解决布料抓取点定位问题，并在ICRA 2023挑战赛中获胜。	manipulation	✅
16	Social Perception of Faces in a Vision-Language Model	利用CLIP研究人脸社会感知：揭示模型偏见与属性影响	manipulation

🔬 支柱八：物理动画 (Physics-based Animation) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
17	LMM-VQA: Advancing Video Quality Assessment with Large Multimodal Models	提出LMM-VQA，利用大型多模态模型提升视频质量评估性能	spatiotemporal large language model multimodal	✅

⬅️ 返回 cs.CV 首页 · 🏠 返回主页