cs.CV（2024-12-05）

📊 共 52 篇论文 | 🔗 14 篇有代码

🎯 兴趣领域导航

支柱三：空间感知与语义 (Perception & Semantics) (18 🔗4) 支柱九：具身大模型 (Embodied Foundation Models) (15 🔗6) 支柱四：生成式动作 (Generative Motion) (5 🔗1) 支柱一：机器人控制 (Robot Control) (5 🔗1) 支柱二：RL算法与架构 (RL & Architecture) (4 🔗1) 支柱六：视频提取与匹配 (Video Extraction) (4 🔗1) 支柱七：动作重定向 (Motion Retargeting) (1)

🔬 支柱三：空间感知与语义 (Perception & Semantics) (18 篇)

#	题目	一句话要点	标签	🔗	⭐
1	SeeGround: See and Ground for Zero-Shot Open-Vocabulary 3D Visual Grounding	提出SeeGround以解决零样本开放词汇3D视觉定位问题	open-vocabulary open vocabulary visual grounding
2	HybridGS: Decoupling Transients and Statics with 2D and 3D Gaussian Splatting	HybridGS：利用2D和3D高斯溅射解耦瞬态和静态场景，实现高质量新视角合成。	3D gaussian splatting 3DGS gaussian splatting
3	Towards Real-Time Open-Vocabulary Video Instance Segmentation	提出TROY-VIS，加速开放词汇视频实例分割，实现实时性。	open-vocabulary open vocabulary foundation model	✅
4	DGNS: Deformable Gaussian Splatting and Dynamic Neural Surface for Monocular Dynamic 3D Reconstruction	DGNS：结合可变形高斯溅射与动态神经表面的单目动态3D重建	gaussian splatting splatting scene reconstruction
5	PhysDepth: Plug-and-Play Physical Refinement for Monocular Depth Estimation in Challenging Environments	PhysDepth：即插即用物理约束单目深度估计，提升恶劣环境性能	depth estimation monocular depth
6	Monocular Dynamic Gaussian Splatting: Fast, Brittle, and Scene Complexity Rules	单目动态高斯溅射：快速但脆弱，受场景复杂度制约	gaussian splatting splatting
7	Mask-Adapter: The Devil is in the Masks for Open-Vocabulary Segmentation	Mask-Adapter：通过优化Mask提升开放词汇分割性能	open-vocabulary open vocabulary	✅
8	Grounding Descriptions in Images informs Zero-Shot Visual Recognition	GRAIN：通过图像区域描述对齐，提升零样本视觉识别能力	open-vocabulary open vocabulary large language model	✅
9	PBDyG: Position Based Dynamic Gaussians for Motion-Aware Clothed Human Avatars	提出PBDyG，通过基于位置的动态高斯模型实现运动感知的服装人像重建	3D gaussian splatting gaussian splatting splatting
10	EmbodiedOcc: Embodied 3D Occupancy Prediction for Vision-based Online Scene Understanding	EmbodiedOcc：提出基于视觉的在线场景理解的具身3D occupancy预测框架	splatting scene understanding	✅
11	Deep Learning and Hybrid Approaches for Dynamic Scene Analysis, Object Detection and Motion Tracking	提出一种基于深度学习和混合方法的动态场景分析与目标检测跟踪系统，优化视频监控。	optical flow motion tracking
12	MT3DNet: Multi-Task learning Network for 3D Surgical Scene Reconstruction	MT3DNet：用于3D手术场景重建的多任务学习网络	depth estimation scene reconstruction
13	Multi-View Pose-Agnostic Change Localization with Zero Labels	提出一种无标签、视角无关的多视角变化定位方法，基于3D高斯溅射实现。	3D gaussian splatting 3DGS gaussian splatting
14	QUEEN: QUantized Efficient ENcoding of Dynamic Gaussians for Streaming Free-viewpoint Videos	提出QUEEN框架以解决在线自由视角视频流传输问题	3D gaussian splatting gaussian splatting splatting
15	Stereo Anywhere: Robust Zero-Shot Deep Stereo Matching Even Where Either Stereo or Mono Fail	提出Stereo Anywhere，结合几何约束与单目深度先验，实现鲁棒的零样本立体匹配。	monocular depth foundation model
16	Turbo3D: Ultra-fast Text-to-3D Generation	Turbo3D：一种超快速的文本到3D高斯溅射生成系统，可在1秒内生成高质量资产。	gaussian splatting splatting
17	MegaSaM: Accurate, Fast, and Robust Structure and Motion from Casual Dynamic Videos	MegaSaM：基于动态视频的快速、准确、鲁棒的结构与运动重建	visual SLAM depth estimation
18	Sparse Voxels Rasterization: Real-time High-fidelity Radiance Field Rendering	提出基于自适应稀疏体素光栅化的实时高保真辐射场渲染方法	gaussian splatting splatting

🔬 支柱九：具身大模型 (Embodied Foundation Models) (15 篇)

#	题目	一句话要点	标签	🔗	⭐
19	FlashSloth: Lightning Multimodal Large Language Models via Embedded Visual Compression	FlashSloth：通过嵌入式视觉压缩加速多模态大语言模型	large language model multimodal
20	AIpparel: A Multimodal Foundation Model for Digital Garments	AIpparel：用于数字服装的多模态基础模型，实现服装生成与编辑	foundation model multimodal	✅
21	MageBench: Bridging Large Multimodal Models to Agents	MageBench：构建连接大型多模态模型与智能体的桥梁，评估视觉推理能力。	multimodal chain-of-thought	✅
22	CreatiLayout: Siamese Multimodal Diffusion Transformer for Creative Layout-to-Image Generation	提出CreatiLayout，基于Siamese多模态扩散Transformer实现可控的布局到图像生成。	large language model multimodal
23	Reflective Teacher: Semi-Supervised Multimodal 3D Object Detection in Bird's-Eye-View via Uncertainty Measure	提出Reflective Teacher和GA-BEVFusion，提升半监督BEV视角3D目标检测性能。	multimodal
24	SIDA: Social Media Image Deepfake Detection, Localization and Explanation with Large Multimodal Model	提出SIDA框架，利用大模型实现社交媒体图像深度伪造的检测、定位与解释。	multimodal
25	Quantifying the Limits of Segmentation Foundation Models: Modeling Challenges in Segmenting Tree-Like and Low-Contrast Objects	量化分割基础模型局限性：建模树状和低对比度对象分割的挑战	foundation model
26	PANGAEA: A Global and Inclusive Benchmark for Geospatial Foundation Models	提出PANGAEA：一个全球通用的地理空间基础模型评估基准，涵盖多样数据集与任务。	foundation model	✅
27	Cross-Self KV Cache Pruning for Efficient Vision-Language Inference	提出Cross-Self Pruning (CSP)方法，用于高效视觉-语言模型推理中的KV缓存剪枝。	large language model multimodal	✅
28	Assessing and Learning Alignment of Unimodal Vision and Language Models	提出SAIL框架，高效对齐单模态视觉和语言模型，提升多模态任务性能。	large language model multimodal	✅
29	p-MoD: Building Mixture-of-Depths MLLMs via Progressive Ratio Decay	提出p-MoD，通过渐进比例衰减构建高效的混合深度多模态大语言模型	large language model multimodal
30	Liquid: Language Models are Scalable and Unified Multi-modal Generators	Liquid：提出可扩展的统一多模态生成模型，提升视觉理解与生成能力。	large language model multimodal	✅
31	VASCAR: Content-Aware Layout Generation via Visual-Aware Self-Correction	VASCAR：通过视觉感知自校正实现内容感知的布局生成	large language model
32	MegaCOIN: Enhancing Medium-Grained Color Perception for Vision-Language Models	MegaCOIN：增强视觉-语言模型对中等粒度色彩的感知能力	multimodal
33	DiffSign: AI-Assisted Generation of Customizable Sign Language Videos With Enhanced Realism	DiffSign：利用AI生成具有增强真实感的可定制手语视频	multimodal

🔬 支柱四：生成式动作 (Generative Motion) (5 篇)

#	题目	一句话要点	标签	🔗	⭐
34	IF-MDM: Implicit Face Motion Diffusion Model for High-Fidelity Realtime Talking Head Generation	提出隐式人脸运动扩散模型IF-MDM，实现高保真实时说话人头部生成。	motion diffusion model MDM motion diffusion
35	Mogo: RQ Hierarchical Causal Transformer for High-Quality 3D Human Motion Generation	提出Mogo：一种基于分层因果Transformer的高质量3D人体运动生成模型	text-to-motion motion generation VQ-VAE
36	RMD: A Simple Baseline for More General Human Motion Generation via Training-free Retrieval-Augmented Motion Diffuse	提出RMD：一种免训练的检索增强运动扩散方法，提升通用人体运动生成能力	motion diffusion model motion diffusion motion generation
37	INFP: Audio-Driven Interactive Head Generation in Dyadic Conversations	INFP：用于双人对话的音频驱动交互式头部生成框架	motion generation motion latent dyadic interaction	✅
38	CRAFT: Designing Creative and Functional 3D Objects	CRAFT：设计具有创造性和功能性的、符合人体工学的3D物体	penetration

🔬 支柱一：机器人控制 (Robot Control) (5 篇)

#	题目	一句话要点	标签	🔗	⭐
39	EditScout: Locating Forged Regions from Diffusion-based Edited Images with Multimodal LLM	EditScout：利用多模态LLM定位扩散模型编辑图像中的伪造区域	manipulation large language model multimodal
40	GigaHands: A Massive Annotated Dataset of Bimanual Hand Activities	GigaHands：大规模双手活动标注数据集，促进AI和机器人领域发展	bi-manual
41	UnZipLoRA: Separating Content and Style from a Single Image	UnZipLoRA：提出一种从单张图像中解耦内容与风格的LoRA方法	manipulation
42	DualPM: Dual Posed-Canonical Point Maps for 3D Shape and Pose Reconstruction	提出DualPM：用于3D形状和姿态重建的双重姿态-规范点映射	quadruped
43	HumanEdit: A High-Quality Human-Rewarded Dataset for Instruction-based Image Editing	HumanEdit：高质量人工标注指令图像编辑数据集，提升编辑精度和多样性	manipulation	✅

🔬 支柱二：RL算法与架构 (RL & Architecture) (4 篇)

#	题目	一句话要点	标签	🔗	⭐
44	Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion	Florence-VL：利用生成式视觉编码器和深度-广度融合增强视觉-语言模型	contrastive learning large language model foundation model	✅
45	Diffusion-Augmented Coreset Expansion for Scalable Dataset Distillation	提出扩散增强型 Coreset 扩展方法，用于可扩展的数据集蒸馏	distillation foundation model
46	SoMA: Singular Value Decomposed Minor Components Adaptation for Domain Generalizable Representation Learning	提出SoMA，通过奇异值分解自适应调整模型次要成分，提升域泛化能力	representation learning foundation model
47	Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation	Divot：利用扩散模型构建视频Tokenizer，实现视频理解与生成	representation learning large language model

🔬 支柱六：视频提取与匹配 (Video Extraction) (4 篇)

#	题目	一句话要点	标签	🔗	⭐
48	HeatFormer: A Neural Optimizer for Multiview Human Mesh Recovery	提出HeatFormer，用于多视角人体网格重建的神经优化方法	human mesh recovery SMPL
49	EgoPoints: Advancing Point Tracking for Egocentric Videos	提出EgoPoints以解决自我中心视频中的点跟踪问题	egocentric
50	Cubify Anything: Scaling Indoor 3D Object Detection	提出CA-1M数据集与CuTR模型，提升室内3D物体检测在数据规模和精度上的性能。	egocentric
51	HANDI: Hand-Centric Text-and-Image Conditioned Video Generation	HANDI：提出手部为中心的文本和图像条件视频生成方法，提升动作细节表现。	Ego4D	✅

🔬 支柱七：动作重定向 (Motion Retargeting) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
52	D-LORD for Motion Stylization	提出D-LORD框架，用于解耦运动序列中的风格与内容，实现运动风格迁移和重定向。	motion retargeting latent optimization

⬅️ 返回 cs.CV 首页 · 🏠 返回主页