cs.CV(2024-11-18)

📊 共 33 篇论文 | 🔗 12 篇有代码

🎯 兴趣领域导航

支柱三:空间感知与语义 (Perception & Semantics) (12 🔗5) 支柱二:RL算法与架构 (RL & Architecture) (9 🔗3) 支柱九:具身大模型 (Embodied Foundation Models) (7 🔗4) 支柱六:视频提取与匹配 (Video Extraction) (3) 支柱一:机器人控制 (Robot Control) (1) 支柱七:动作重定向 (Motion Retargeting) (1)

🔬 支柱三:空间感知与语义 (Perception & Semantics) (12 篇)

#题目一句话要点标签🔗
1 GPS-Gaussian+: Generalizable Pixel-wise 3D Gaussian Splatting for Real-Time Human-Scene Rendering from Sparse Views 提出GPS-Gaussian+,一种可泛化的像素级3D高斯溅射方法,用于从稀疏视角实时渲染人与场景。 depth estimation 3D gaussian splatting gaussian splatting
2 Towards Open-Vocabulary Audio-Visual Event Localization 提出OV-AVEL任务与OV-AVEBench数据集,实现开放词汇的音视频事件定位。 open-vocabulary open vocabulary multimodal
3 Scalable Autoregressive Monocular Depth Estimation 提出可扩展的自回归单目深度估计模型DAR,显著提升深度估计精度。 depth estimation monocular depth Depth Anything
4 TimeFormer: Capturing Temporal Relationships of Deformable 3D Gaussians for Robust Reconstruction 提出TimeFormer,通过时序Transformer建模动态3D高斯重建中的运动关系 3D gaussian splatting gaussian splatting splatting
5 UniHands: Unifying Various Wild-Collected Keypoints for Personalized Hand Reconstruction UniHands:统一多种野外采集关键点,实现个性化手部重建 implicit representation MANO hand reconstruction
6 MGNiceNet: Unified Monocular Geometric Scene Understanding MGNiceNet:面向自动驾驶的统一单目几何场景理解框架 depth estimation monocular depth scene understanding
7 DeSiRe-GS: 4D Street Gaussians for Static-Dynamic Decomposition and Surface Reconstruction for Urban Driving Scenes DeSiRe-GS:用于城市驾驶场景静态-动态分解和表面重建的4D街景高斯模型 3D gaussian splatting gaussian splatting splatting
8 Towards Degradation-Robust Reconstruction in Generalizable NeRF 提出Objaverse Blur数据集与3D感知特征模块,提升GNeRF在模糊降质下的重建鲁棒性 NeRF neural radiance field
9 The ADUULM-360 Dataset -- A Multi-Modal Dataset for Depth Estimation in Adverse Weather 提出ADUULM-360多模态数据集,用于恶劣天气下的深度估计研究。 depth estimation scene understanding
10 LeC$^2$O-NeRF: Learning Continuous and Compact Large-Scale Occupancy for Urban Scenes LeC$^2$O-NeRF:学习连续紧凑的大规模场景 occupancy 以加速城市场景 NeRF 训练。 NeRF occupancy grid
11 ITACLIP: Boosting Training-Free Semantic Segmentation with Image, Text, and Architectural Enhancements ITACLIP:通过图像、文本和架构增强提升免训练语义分割性能 open-vocabulary open vocabulary large language model
12 Reducing Label Dependency for Underwater Scene Understanding: A Survey of Datasets, Techniques and Applications 水下场景理解:减少标签依赖的数据集、技术与应用综述 scene understanding

🔬 支柱二:RL算法与架构 (RL & Architecture) (9 篇)

#题目一句话要点标签🔗
13 FLAME: Frozen Large Language Models Enable Data-Efficient Language-Image Pre-training FLAME:利用冻结的大型语言模型实现数据高效的语言-图像预训练 distillation large language model
14 RAWMamba: Unified sRGB-to-RAW De-rendering With State Space Model 提出RAWMamba,用于统一sRGB到RAW的图像和视频去渲染任务 Mamba state space model
15 Cross-Patient Pseudo Bags Generation and Curriculum Contrastive Learning for Imbalanced Multiclassification of Whole Slide Image 提出跨患者伪包生成与课程对比学习方法,解决WSI不平衡多分类问题 representation learning contrastive learning
16 Relational Contrastive Learning and Masked Image Modeling for Scene Text Recognition 提出RCMSTR,融合关系对比学习与掩码图像建模,提升场景文本识别性能。 representation learning contrastive learning
17 Distill the Best, Ignore the Rest: Improving Dataset Distillation with Loss-Value-Based Pruning 提出基于损失值剪枝的数据集蒸馏方法,提升泛化性和蒸馏质量。 distillation
18 SpatialDreamer: Self-supervised Stereo Video Synthesis from Monocular Input SpatialDreamer:提出一种自监督立体视频合成方法,解决单目视频生成立体视频问题。 dreamer
19 Color-Oriented Redundancy Reduction in Dataset Distillation 提出AutoPalette框架,通过颜色导向的冗余缩减提升数据集蒸馏性能。 distillation
20 Latent Knowledge-Guided Video Diffusion for Scientific Phenomena Generation from a Single Initial Frame 提出基于潜在知识引导的视频扩散模型,用于从单帧生成科学现象视频 masked autoencoder optical flow
21 In-Situ Melt Pool Characterization via Thermal Imaging for Defect Detection in Directed Energy Deposition Using Vision Transformers 利用视觉Transformer和热成像技术,原位表征熔池以检测定向能量沉积缺陷。 masked autoencoder MAE

🔬 支柱九:具身大模型 (Embodied Foundation Models) (7 篇)

#题目一句话要点标签🔗
22 MAIRA-Seg: Enhancing Radiology Report Generation with Segmentation-Aware Multimodal Large Language Models MAIRA-Seg:利用分割感知多模态大语言模型提升放射报告生成质量 large language model multimodal
23 AtomThink: Multimodal Slow Thinking with Atomic Step Reasoning AtomThink:通过原子步骤推理实现多模态慢思考,提升复杂推理任务性能。 large language model multimodal chain-of-thought
24 Efficient Transfer Learning for Video-language Foundation Models 提出多模态时空适配器以解决视频语言模型的迁移学习问题 foundation model zero-shot transfer
25 The Power of Many: Multi-Agent Multimodal Models for Cultural Image Captioning 提出MosAIC多智能体框架,利用LMMs提升文化图像描述生成效果 multimodal
26 CCExpert: Advancing MLLM Capability in Remote Sensing Change Captioning with Difference-Aware Integration and a Foundational Dataset CCExpert:通过差异感知集成和基础数据集提升MLLM在遥感变化描述中的能力 large language model multimodal
27 Leveraging MLLM Embeddings and Attribute Smoothing for Compositional Zero-Shot Learning 提出MLLM嵌入与属性平滑引导的解耦框架,提升组合零样本学习性能 large language model multimodal
28 PSA-VLM: Enhancing Vision-Language Model Safety through Progressive Concept-Bottleneck-Driven Alignment 提出PSA-VLM,通过概念瓶颈对齐增强视觉语言模型的安全性 large language model

🔬 支柱六:视频提取与匹配 (Video Extraction) (3 篇)

#题目一句话要点标签🔗
29 SignEye: Traffic Sign Interpretation from Vehicle First-Person View 提出SignEye,实现车辆第一人称视角的交通标志理解与交通引导辅助。 egocentric first-person view
30 DeforHMR: Vision Transformer with Deformable Cross-Attention for 3D Human Mesh Recovery DeforHMR:利用可变形交叉注意力Transformer进行3D人体网格重建 human mesh recovery HMR
31 Generative World Explorer 提出Generative World Explorer,用于具身智能体在3D城市场景中的心理探索与决策。 egocentric embodied AI

🔬 支柱一:机器人控制 (Robot Control) (1 篇)

#题目一句话要点标签🔗
32 FruitNinja: 3D Object Interior Texture Generation with Gaussian Splatting FruitNinja:利用高斯溅射生成3D物体内部纹理,实现实时切片与渲染 manipulation 3D gaussian splatting 3DGS

🔬 支柱七:动作重定向 (Motion Retargeting) (1 篇)

#题目一句话要点标签🔗
33 LaVin-DiT: Large Vision Diffusion Transformer 提出LaVin-DiT,一种用于解决多种视觉任务的可扩展统一视觉扩散Transformer基础模型。 spatial relationship foundation model

⬅️ 返回 cs.CV 首页 · 🏠 返回主页