cs.CV(2024-08-26)

📊 共 17 篇论文 | 🔗 7 篇有代码

🎯 兴趣领域导航

支柱二:RL算法与架构 (RL & Architecture) (5 🔗1) 支柱九:具身大模型 (Embodied Foundation Models) (4 🔗3) 支柱三:空间感知与语义 (Perception & Semantics) (3 🔗1) 支柱六:视频提取与匹配 (Video Extraction) (2) 支柱一:机器人控制 (Robot Control) (2 🔗1) 支柱八:物理动画 (Physics-based Animation) (1 🔗1)

🔬 支柱二:RL算法与架构 (RL & Architecture) (5 篇)

#题目一句话要点标签🔗
1 ShapeMamba-EM: Fine-Tuning Foundation Model with Local Shape Descriptors and Mamba Blocks for 3D EM Image Segmentation ShapeMamba-EM:结合局部形状描述子与Mamba块微调基础模型,用于3D EM图像分割 Mamba foundation model
2 LoG-VMamba: Local-Global Vision Mamba for Medical Image Segmentation LoG-VMamba:用于医学图像分割的局部-全局视觉Mamba模型 Mamba SSM state space model
3 Driving in the Occupancy World: Vision-Centric 4D Occupancy Forecasting and Planning via World Models for Autonomous Driving Drive-OccWorld:提出视觉中心4D Occupancy预测与规划的世界模型,用于自动驾驶。 world model spatiotemporal
4 Global-Local Distillation Network-Based Audio-Visual Speaker Tracking with Incomplete Modalities 提出基于全局-局部蒸馏网络的音视频说话人跟踪方法,解决模态缺失下的鲁棒跟踪问题。 teacher-student distillation
5 Let Video Teaches You More: Video-to-Image Knowledge Distillation using DEtection TRansformer for Medical Video Lesion Detection 提出V2I-DETR,利用视频知识蒸馏提升医学视频病灶检测效率与精度。 teacher-student distillation

🔬 支柱九:具身大模型 (Embodied Foundation Models) (4 篇)

#题目一句话要点标签🔗
6 A Practitioner's Guide to Continual Multimodal Pretraining 提出FoMo-in-Flux基准,为多模态预训练模型在实际部署中的持续更新提供指导。 foundation model multimodal
7 MMR: Evaluating Reading Ability of Large Multimodal Models 提出多模态阅读基准MMR,用于评估大型多模态模型在文本丰富图像中的阅读理解能力。 multimodal
8 An Embedding is Worth a Thousand Noisy Labels 提出WANN:利用自监督特征和可靠性评分,有效应对带噪标签问题 foundation model
9 Video-CCAM: Enhancing Video-Language Understanding with Causal Cross-Attention Masks for Short and Long Videos Video-CCAM:利用因果交叉注意力掩码增强视频语言理解能力,适用于短视频和长视频 large language model

🔬 支柱三:空间感知与语义 (Perception & Semantics) (3 篇)

#题目一句话要点标签🔗
10 DynaSurfGS: Dynamic Surface Reconstruction with Planar-based Gaussian Splatting DynaSurfGS:基于平面高斯溅射的动态表面重建方法 gaussian splatting splatting scene reconstruction
11 NimbleD: Enhancing Self-supervised Monocular Depth Estimation with Pseudo-labels and Large-scale Video Pre-training NimbleD:利用伪标签和大规模视频预训练提升自监督单目深度估计 depth estimation monocular depth
12 Avatar Concept Slider: Controllable Editing of Concepts in 3D Human Avatars 提出Avatar Concept Slider,实现3D人体Avatar概念的可控编辑 3D gaussian splatting gaussian splatting splatting

🔬 支柱六:视频提取与匹配 (Video Extraction) (2 篇)

#题目一句话要点标签🔗
13 Grounded Multi-Hop VideoQA in Long-Form Egocentric Videos 提出GeLM模型,解决长时第一视角视频多跳问答中的时序定位与推理难题 egocentric large language model
14 MagicMan: Generative Novel View Synthesis of Humans with 3D-Aware Diffusion and Iterative Refinement MagicMan:利用3D感知扩散和迭代优化实现人体新视角合成 SMPL SMPL-X

🔬 支柱一:机器人控制 (Robot Control) (2 篇)

#题目一句话要点标签🔗
15 Center Direction Network for Grasping Point Localization on Cloths 提出CeDiRNet-3DoF,用于解决布料抓取点定位问题,并在ICRA 2023挑战赛中获胜。 manipulation
16 Social Perception of Faces in a Vision-Language Model 利用CLIP研究人脸社会感知:揭示模型偏见与属性影响 manipulation

🔬 支柱八:物理动画 (Physics-based Animation) (1 篇)

#题目一句话要点标签🔗
17 LMM-VQA: Advancing Video Quality Assessment with Large Multimodal Models 提出LMM-VQA,利用大型多模态模型提升视频质量评估性能 spatiotemporal large language model multimodal

⬅️ 返回 cs.CV 首页 · 🏠 返回主页