cs.CV(2024-12-05)

📊 共 52 篇论文 | 🔗 14 篇有代码

🎯 兴趣领域导航

支柱三:空间感知与语义 (Perception & Semantics) (18 🔗4) 支柱九:具身大模型 (Embodied Foundation Models) (15 🔗6) 支柱四:生成式动作 (Generative Motion) (5 🔗1) 支柱一:机器人控制 (Robot Control) (5 🔗1) 支柱二:RL算法与架构 (RL & Architecture) (4 🔗1) 支柱六:视频提取与匹配 (Video Extraction) (4 🔗1) 支柱七:动作重定向 (Motion Retargeting) (1)

🔬 支柱三:空间感知与语义 (Perception & Semantics) (18 篇)

#题目一句话要点标签🔗
1 SeeGround: See and Ground for Zero-Shot Open-Vocabulary 3D Visual Grounding 提出SeeGround以解决零样本开放词汇3D视觉定位问题 open-vocabulary open vocabulary visual grounding
2 HybridGS: Decoupling Transients and Statics with 2D and 3D Gaussian Splatting HybridGS:利用2D和3D高斯溅射解耦瞬态和静态场景,实现高质量新视角合成。 3D gaussian splatting 3DGS gaussian splatting
3 Towards Real-Time Open-Vocabulary Video Instance Segmentation 提出TROY-VIS,加速开放词汇视频实例分割,实现实时性。 open-vocabulary open vocabulary foundation model
4 DGNS: Deformable Gaussian Splatting and Dynamic Neural Surface for Monocular Dynamic 3D Reconstruction DGNS:结合可变形高斯溅射与动态神经表面的单目动态3D重建 gaussian splatting splatting scene reconstruction
5 PhysDepth: Plug-and-Play Physical Refinement for Monocular Depth Estimation in Challenging Environments PhysDepth:即插即用物理约束单目深度估计,提升恶劣环境性能 depth estimation monocular depth
6 Monocular Dynamic Gaussian Splatting: Fast, Brittle, and Scene Complexity Rules 单目动态高斯溅射:快速但脆弱,受场景复杂度制约 gaussian splatting splatting
7 Mask-Adapter: The Devil is in the Masks for Open-Vocabulary Segmentation Mask-Adapter:通过优化Mask提升开放词汇分割性能 open-vocabulary open vocabulary
8 Grounding Descriptions in Images informs Zero-Shot Visual Recognition GRAIN:通过图像区域描述对齐,提升零样本视觉识别能力 open-vocabulary open vocabulary large language model
9 PBDyG: Position Based Dynamic Gaussians for Motion-Aware Clothed Human Avatars 提出PBDyG,通过基于位置的动态高斯模型实现运动感知的服装人像重建 3D gaussian splatting gaussian splatting splatting
10 EmbodiedOcc: Embodied 3D Occupancy Prediction for Vision-based Online Scene Understanding EmbodiedOcc:提出基于视觉的在线场景理解的具身3D occupancy预测框架 splatting scene understanding
11 Deep Learning and Hybrid Approaches for Dynamic Scene Analysis, Object Detection and Motion Tracking 提出一种基于深度学习和混合方法的动态场景分析与目标检测跟踪系统,优化视频监控。 optical flow motion tracking
12 MT3DNet: Multi-Task learning Network for 3D Surgical Scene Reconstruction MT3DNet:用于3D手术场景重建的多任务学习网络 depth estimation scene reconstruction
13 Multi-View Pose-Agnostic Change Localization with Zero Labels 提出一种无标签、视角无关的多视角变化定位方法,基于3D高斯溅射实现。 3D gaussian splatting 3DGS gaussian splatting
14 QUEEN: QUantized Efficient ENcoding of Dynamic Gaussians for Streaming Free-viewpoint Videos 提出QUEEN框架以解决在线自由视角视频流传输问题 3D gaussian splatting gaussian splatting splatting
15 Stereo Anywhere: Robust Zero-Shot Deep Stereo Matching Even Where Either Stereo or Mono Fail 提出Stereo Anywhere,结合几何约束与单目深度先验,实现鲁棒的零样本立体匹配。 monocular depth foundation model
16 Turbo3D: Ultra-fast Text-to-3D Generation Turbo3D:一种超快速的文本到3D高斯溅射生成系统,可在1秒内生成高质量资产。 gaussian splatting splatting
17 MegaSaM: Accurate, Fast, and Robust Structure and Motion from Casual Dynamic Videos MegaSaM:基于动态视频的快速、准确、鲁棒的结构与运动重建 visual SLAM depth estimation
18 Sparse Voxels Rasterization: Real-time High-fidelity Radiance Field Rendering 提出基于自适应稀疏体素光栅化的实时高保真辐射场渲染方法 gaussian splatting splatting

🔬 支柱九:具身大模型 (Embodied Foundation Models) (15 篇)

#题目一句话要点标签🔗
19 FlashSloth: Lightning Multimodal Large Language Models via Embedded Visual Compression FlashSloth:通过嵌入式视觉压缩加速多模态大语言模型 large language model multimodal
20 AIpparel: A Multimodal Foundation Model for Digital Garments AIpparel:用于数字服装的多模态基础模型,实现服装生成与编辑 foundation model multimodal
21 MageBench: Bridging Large Multimodal Models to Agents MageBench:构建连接大型多模态模型与智能体的桥梁,评估视觉推理能力。 multimodal chain-of-thought
22 CreatiLayout: Siamese Multimodal Diffusion Transformer for Creative Layout-to-Image Generation 提出CreatiLayout,基于Siamese多模态扩散Transformer实现可控的布局到图像生成。 large language model multimodal
23 Reflective Teacher: Semi-Supervised Multimodal 3D Object Detection in Bird's-Eye-View via Uncertainty Measure 提出Reflective Teacher和GA-BEVFusion,提升半监督BEV视角3D目标检测性能。 multimodal
24 SIDA: Social Media Image Deepfake Detection, Localization and Explanation with Large Multimodal Model 提出SIDA框架,利用大模型实现社交媒体图像深度伪造的检测、定位与解释。 multimodal
25 Quantifying the Limits of Segmentation Foundation Models: Modeling Challenges in Segmenting Tree-Like and Low-Contrast Objects 量化分割基础模型局限性:建模树状和低对比度对象分割的挑战 foundation model
26 PANGAEA: A Global and Inclusive Benchmark for Geospatial Foundation Models 提出PANGAEA:一个全球通用的地理空间基础模型评估基准,涵盖多样数据集与任务。 foundation model
27 Cross-Self KV Cache Pruning for Efficient Vision-Language Inference 提出Cross-Self Pruning (CSP)方法,用于高效视觉-语言模型推理中的KV缓存剪枝。 large language model multimodal
28 Assessing and Learning Alignment of Unimodal Vision and Language Models 提出SAIL框架,高效对齐单模态视觉和语言模型,提升多模态任务性能。 large language model multimodal
29 p-MoD: Building Mixture-of-Depths MLLMs via Progressive Ratio Decay 提出p-MoD,通过渐进比例衰减构建高效的混合深度多模态大语言模型 large language model multimodal
30 Liquid: Language Models are Scalable and Unified Multi-modal Generators Liquid:提出可扩展的统一多模态生成模型,提升视觉理解与生成能力。 large language model multimodal
31 VASCAR: Content-Aware Layout Generation via Visual-Aware Self-Correction VASCAR:通过视觉感知自校正实现内容感知的布局生成 large language model
32 MegaCOIN: Enhancing Medium-Grained Color Perception for Vision-Language Models MegaCOIN:增强视觉-语言模型对中等粒度色彩的感知能力 multimodal
33 DiffSign: AI-Assisted Generation of Customizable Sign Language Videos With Enhanced Realism DiffSign:利用AI生成具有增强真实感的可定制手语视频 multimodal

🔬 支柱四:生成式动作 (Generative Motion) (5 篇)

#题目一句话要点标签🔗
34 IF-MDM: Implicit Face Motion Diffusion Model for High-Fidelity Realtime Talking Head Generation 提出隐式人脸运动扩散模型IF-MDM,实现高保真实时说话人头部生成。 motion diffusion model MDM motion diffusion
35 Mogo: RQ Hierarchical Causal Transformer for High-Quality 3D Human Motion Generation 提出Mogo:一种基于分层因果Transformer的高质量3D人体运动生成模型 text-to-motion motion generation VQ-VAE
36 RMD: A Simple Baseline for More General Human Motion Generation via Training-free Retrieval-Augmented Motion Diffuse 提出RMD:一种免训练的检索增强运动扩散方法,提升通用人体运动生成能力 motion diffusion model motion diffusion motion generation
37 INFP: Audio-Driven Interactive Head Generation in Dyadic Conversations INFP:用于双人对话的音频驱动交互式头部生成框架 motion generation motion latent dyadic interaction
38 CRAFT: Designing Creative and Functional 3D Objects CRAFT:设计具有创造性和功能性的、符合人体工学的3D物体 penetration

🔬 支柱一:机器人控制 (Robot Control) (5 篇)

#题目一句话要点标签🔗
39 EditScout: Locating Forged Regions from Diffusion-based Edited Images with Multimodal LLM EditScout:利用多模态LLM定位扩散模型编辑图像中的伪造区域 manipulation large language model multimodal
40 GigaHands: A Massive Annotated Dataset of Bimanual Hand Activities GigaHands:大规模双手活动标注数据集,促进AI和机器人领域发展 bi-manual
41 UnZipLoRA: Separating Content and Style from a Single Image UnZipLoRA:提出一种从单张图像中解耦内容与风格的LoRA方法 manipulation
42 DualPM: Dual Posed-Canonical Point Maps for 3D Shape and Pose Reconstruction 提出DualPM:用于3D形状和姿态重建的双重姿态-规范点映射 quadruped
43 HumanEdit: A High-Quality Human-Rewarded Dataset for Instruction-based Image Editing HumanEdit:高质量人工标注指令图像编辑数据集,提升编辑精度和多样性 manipulation

🔬 支柱二:RL算法与架构 (RL & Architecture) (4 篇)

#题目一句话要点标签🔗
44 Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion Florence-VL:利用生成式视觉编码器和深度-广度融合增强视觉-语言模型 contrastive learning large language model foundation model
45 Diffusion-Augmented Coreset Expansion for Scalable Dataset Distillation 提出扩散增强型 Coreset 扩展方法,用于可扩展的数据集蒸馏 distillation foundation model
46 SoMA: Singular Value Decomposed Minor Components Adaptation for Domain Generalizable Representation Learning 提出SoMA,通过奇异值分解自适应调整模型次要成分,提升域泛化能力 representation learning foundation model
47 Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation Divot:利用扩散模型构建视频Tokenizer,实现视频理解与生成 representation learning large language model

🔬 支柱六:视频提取与匹配 (Video Extraction) (4 篇)

#题目一句话要点标签🔗
48 HeatFormer: A Neural Optimizer for Multiview Human Mesh Recovery 提出HeatFormer,用于多视角人体网格重建的神经优化方法 human mesh recovery SMPL
49 EgoPoints: Advancing Point Tracking for Egocentric Videos 提出EgoPoints以解决自我中心视频中的点跟踪问题 egocentric
50 Cubify Anything: Scaling Indoor 3D Object Detection 提出CA-1M数据集与CuTR模型,提升室内3D物体检测在数据规模和精度上的性能。 egocentric
51 HANDI: Hand-Centric Text-and-Image Conditioned Video Generation HANDI:提出手部为中心的文本和图像条件视频生成方法,提升动作细节表现。 Ego4D

🔬 支柱七:动作重定向 (Motion Retargeting) (1 篇)

#题目一句话要点标签🔗
52 D-LORD for Motion Stylization 提出D-LORD框架,用于解耦运动序列中的风格与内容,实现运动风格迁移和重定向。 motion retargeting latent optimization

⬅️ 返回 cs.CV 首页 · 🏠 返回主页