cs.CV（2024-11-18）

📊 共 33 篇论文 | 🔗 12 篇有代码

🎯 兴趣领域导航

支柱三：空间感知与语义 (Perception & Semantics) (12 🔗5) 支柱二：RL算法与架构 (RL & Architecture) (9 🔗3) 支柱九：具身大模型 (Embodied Foundation Models) (7 🔗4) 支柱六：视频提取与匹配 (Video Extraction) (3) 支柱一：机器人控制 (Robot Control) (1) 支柱七：动作重定向 (Motion Retargeting) (1)

🔬 支柱三：空间感知与语义 (Perception & Semantics) (12 篇)

#	题目	一句话要点	标签	🔗	⭐
1	GPS-Gaussian+: Generalizable Pixel-wise 3D Gaussian Splatting for Real-Time Human-Scene Rendering from Sparse Views	提出GPS-Gaussian+，一种可泛化的像素级3D高斯溅射方法，用于从稀疏视角实时渲染人与场景。	depth estimation 3D gaussian splatting gaussian splatting
2	Towards Open-Vocabulary Audio-Visual Event Localization	提出OV-AVEL任务与OV-AVEBench数据集，实现开放词汇的音视频事件定位。	open-vocabulary open vocabulary multimodal
3	Scalable Autoregressive Monocular Depth Estimation	提出可扩展的自回归单目深度估计模型DAR，显著提升深度估计精度。	depth estimation monocular depth Depth Anything
4	TimeFormer: Capturing Temporal Relationships of Deformable 3D Gaussians for Robust Reconstruction	提出TimeFormer，通过时序Transformer建模动态3D高斯重建中的运动关系	3D gaussian splatting gaussian splatting splatting	✅
5	UniHands: Unifying Various Wild-Collected Keypoints for Personalized Hand Reconstruction	UniHands：统一多种野外采集关键点，实现个性化手部重建	implicit representation MANO hand reconstruction
6	MGNiceNet: Unified Monocular Geometric Scene Understanding	MGNiceNet：面向自动驾驶的统一单目几何场景理解框架	depth estimation monocular depth scene understanding	✅
7	DeSiRe-GS: 4D Street Gaussians for Static-Dynamic Decomposition and Surface Reconstruction for Urban Driving Scenes	DeSiRe-GS：用于城市驾驶场景静态-动态分解和表面重建的4D街景高斯模型	3D gaussian splatting gaussian splatting splatting	✅
8	Towards Degradation-Robust Reconstruction in Generalizable NeRF	提出Objaverse Blur数据集与3D感知特征模块，提升GNeRF在模糊降质下的重建鲁棒性	NeRF neural radiance field
9	The ADUULM-360 Dataset -- A Multi-Modal Dataset for Depth Estimation in Adverse Weather	提出ADUULM-360多模态数据集，用于恶劣天气下的深度估计研究。	depth estimation scene understanding	✅
10	LeC$^2$O-NeRF: Learning Continuous and Compact Large-Scale Occupancy for Urban Scenes	LeC$^2$O-NeRF：学习连续紧凑的大规模场景 occupancy 以加速城市场景 NeRF 训练。	NeRF occupancy grid
11	ITACLIP: Boosting Training-Free Semantic Segmentation with Image, Text, and Architectural Enhancements	ITACLIP：通过图像、文本和架构增强提升免训练语义分割性能	open-vocabulary open vocabulary large language model	✅
12	Reducing Label Dependency for Underwater Scene Understanding: A Survey of Datasets, Techniques and Applications	水下场景理解：减少标签依赖的数据集、技术与应用综述	scene understanding

🔬 支柱二：RL算法与架构 (RL & Architecture) (9 篇)

#	题目	一句话要点	标签	🔗	⭐
13	FLAME: Frozen Large Language Models Enable Data-Efficient Language-Image Pre-training	FLAME：利用冻结的大型语言模型实现数据高效的语言-图像预训练	distillation large language model	✅
14	RAWMamba: Unified sRGB-to-RAW De-rendering With State Space Model	提出RAWMamba，用于统一sRGB到RAW的图像和视频去渲染任务	Mamba state space model
15	Cross-Patient Pseudo Bags Generation and Curriculum Contrastive Learning for Imbalanced Multiclassification of Whole Slide Image	提出跨患者伪包生成与课程对比学习方法，解决WSI不平衡多分类问题	representation learning contrastive learning
16	Relational Contrastive Learning and Masked Image Modeling for Scene Text Recognition	提出RCMSTR，融合关系对比学习与掩码图像建模，提升场景文本识别性能。	representation learning contrastive learning	✅
17	Distill the Best, Ignore the Rest: Improving Dataset Distillation with Loss-Value-Based Pruning	提出基于损失值剪枝的数据集蒸馏方法，提升泛化性和蒸馏质量。	distillation
18	SpatialDreamer: Self-supervised Stereo Video Synthesis from Monocular Input	SpatialDreamer：提出一种自监督立体视频合成方法，解决单目视频生成立体视频问题。	dreamer
19	Color-Oriented Redundancy Reduction in Dataset Distillation	提出AutoPalette框架，通过颜色导向的冗余缩减提升数据集蒸馏性能。	distillation	✅
20	Latent Knowledge-Guided Video Diffusion for Scientific Phenomena Generation from a Single Initial Frame	提出基于潜在知识引导的视频扩散模型，用于从单帧生成科学现象视频	masked autoencoder optical flow
21	In-Situ Melt Pool Characterization via Thermal Imaging for Defect Detection in Directed Energy Deposition Using Vision Transformers	利用视觉Transformer和热成像技术，原位表征熔池以检测定向能量沉积缺陷。	masked autoencoder MAE

🔬 支柱九：具身大模型 (Embodied Foundation Models) (7 篇)

#	题目	一句话要点	标签	🔗	⭐
22	MAIRA-Seg: Enhancing Radiology Report Generation with Segmentation-Aware Multimodal Large Language Models	MAIRA-Seg：利用分割感知多模态大语言模型提升放射报告生成质量	large language model multimodal
23	AtomThink: Multimodal Slow Thinking with Atomic Step Reasoning	AtomThink：通过原子步骤推理实现多模态慢思考，提升复杂推理任务性能。	large language model multimodal chain-of-thought	✅
24	Efficient Transfer Learning for Video-language Foundation Models	提出多模态时空适配器以解决视频语言模型的迁移学习问题	foundation model zero-shot transfer
25	The Power of Many: Multi-Agent Multimodal Models for Cultural Image Captioning	提出MosAIC多智能体框架，利用LMMs提升文化图像描述生成效果	multimodal	✅
26	CCExpert: Advancing MLLM Capability in Remote Sensing Change Captioning with Difference-Aware Integration and a Foundational Dataset	CCExpert：通过差异感知集成和基础数据集提升MLLM在遥感变化描述中的能力	large language model multimodal	✅
27	Leveraging MLLM Embeddings and Attribute Smoothing for Compositional Zero-Shot Learning	提出MLLM嵌入与属性平滑引导的解耦框架，提升组合零样本学习性能	large language model multimodal	✅
28	PSA-VLM: Enhancing Vision-Language Model Safety through Progressive Concept-Bottleneck-Driven Alignment	提出PSA-VLM，通过概念瓶颈对齐增强视觉语言模型的安全性	large language model

🔬 支柱六：视频提取与匹配 (Video Extraction) (3 篇)

#	题目	一句话要点	标签	🔗	⭐
29	SignEye: Traffic Sign Interpretation from Vehicle First-Person View	提出SignEye，实现车辆第一人称视角的交通标志理解与交通引导辅助。	egocentric first-person view
30	DeforHMR: Vision Transformer with Deformable Cross-Attention for 3D Human Mesh Recovery	DeforHMR：利用可变形交叉注意力Transformer进行3D人体网格重建	human mesh recovery HMR
31	Generative World Explorer	提出Generative World Explorer，用于具身智能体在3D城市场景中的心理探索与决策。	egocentric embodied AI

🔬 支柱一：机器人控制 (Robot Control) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
32	FruitNinja: 3D Object Interior Texture Generation with Gaussian Splatting	FruitNinja：利用高斯溅射生成3D物体内部纹理，实现实时切片与渲染	manipulation 3D gaussian splatting 3DGS

🔬 支柱七：动作重定向 (Motion Retargeting) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
33	LaVin-DiT: Large Vision Diffusion Transformer	提出LaVin-DiT，一种用于解决多种视觉任务的可扩展统一视觉扩散Transformer基础模型。	spatial relationship foundation model

⬅️ 返回 cs.CV 首页 · 🏠 返回主页