cs.CV(2025-04-04)

📊 共 23 篇论文 | 🔗 7 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (9 🔗2) 支柱二:RL算法与架构 (RL & Architecture) (6 🔗3) 支柱三:空间感知与语义 (Perception & Semantics) (4 🔗2) 支柱六:视频提取与匹配 (Video Extraction) (2) 支柱一:机器人控制 (Robot Control) (1) 支柱四:生成式动作 (Generative Motion) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (9 篇)

#题目一句话要点标签🔗
1 MME-Unify: A Comprehensive Benchmark for Unified Multimodal Understanding and Generation Models MME-Unify:一个用于统一多模态理解与生成模型的综合性评测基准。 multimodal
2 Multimodal Diffusion Bridge with Attention-Based SAR Fusion for Satellite Image Cloud Removal 提出DB-CR:一种基于注意力SAR融合的多模态扩散桥卫星图像去云方法 multimodal
3 RANa: Retrieval-Augmented Navigation 提出RANa:一种检索增强的导航方法,利用历史经验提升机器人导航性能。 foundation model zero-shot transfer
4 VISTA-OCR: Towards generative and interactive end to end OCR models 提出VISTA-OCR,一个生成式交互式端到端OCR模型,统一文本检测与识别。 large language model multimodal
5 VideoComp: Advancing Fine-Grained Compositional and Temporal Alignment in Video-Text Models VideoComp:提升视频-文本模型在细粒度组合性和时间对齐方面的能力 multimodal
6 Can ChatGPT Learn My Life From a Week of First-Person Video? 利用第一人称视频,探索ChatGPT学习个人生活信息的能力 foundation model
7 ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use ScreenSpot-Pro:针对专业高分辨率计算机使用的GUI定位基准与ScreenSeekeR方法 large language model
8 Know What You do Not Know: Verbalized Uncertainty Estimation Robustness on Corrupted Images in Vision-Language Models 研究视觉语言模型在图像损坏下的不确定性估计鲁棒性问题 large language model
9 TokenFLEX: Unified VLM Training for Flexible Visual Tokens Inference TokenFLEX:提出一种统一的VLM训练框架,实现视觉tokens数量的灵活推理。 large language model

🔬 支柱二:RL算法与架构 (RL & Architecture) (6 篇)

#题目一句话要点标签🔗
10 HumanDreamer-X: Photorealistic Single-image Human Avatars Reconstruction via Gaussian Restoration 提出HumanDreamer-X以解决单图人类重建中的几何不一致问题 dreamer 3D gaussian splatting 3DGS
11 Mamba as a Bridge: Where Vision Foundation Models Meet Vision Language Models for Domain-Generalized Semantic Segmentation 提出MFuser,利用Mamba融合视觉和视觉-语言模型,提升领域泛化语义分割性能。 Mamba foundation model
12 RingMoE: Mixture-of-Modality-Experts Multi-Modal Foundation Models for Universal Remote Sensing Image Interpretation 提出RingMoE,用于通用遥感图像理解的多模态混合专家模型 representation learning depth estimation foundation model
13 LV-MAE: Learning Long Video Representations through Masked-Embedding Autoencoders 提出LV-MAE,通过掩码嵌入自编码器学习长视频表征,提升长视频理解能力。 MAE spatiotemporal multimodal
14 Pyramid-based Mamba Multi-class Unsupervised Anomaly Detection 提出基于金字塔Mamba的多类别无监督异常检测方法,提升小异常定位精度。 Mamba SSM state space model
15 Joint Retrieval of Cloud properties using Attention-based Deep Learning Models 提出基于注意力机制的CloudUNet模型,用于云光学厚度和有效半径的联合反演。 MAE spatial relationship

🔬 支柱三:空间感知与语义 (Perception & Semantics) (4 篇)

#题目一句话要点标签🔗
16 Scaling Open-Vocabulary Action Detection 提出一种可扩展的开放词汇动作检测方法,解决现有方法对大规模数据集和参数量大的依赖。 open-vocabulary open vocabulary multimodal
17 WildGS-SLAM: Monocular Gaussian Splatting SLAM in Dynamic Environments WildGS-SLAM:单目高斯溅射SLAM,解决动态环境下的鲁棒建图问题 gaussian splatting splatting
18 SARLANG-1M: A Benchmark for Vision-Language Modeling in SAR Image Understanding 提出SARLANG-1M:用于SAR图像理解的视觉-语言建模基准 open-vocabulary open vocabulary penetration
19 FaR: Enhancing Multi-Concept Text-to-Image Diffusion via Concept Fusion and Localized Refinement FaR:通过概念融合和局部细化增强多概念文本到图像扩散模型 concept fusion

🔬 支柱六:视频提取与匹配 (Video Extraction) (2 篇)

#题目一句话要点标签🔗
20 ProbRes: Probabilistic Jump Diffusion for Open-World Egocentric Activity Recognition ProbRes:基于概率跳跃扩散的开放世界自我中心活动识别 egocentric
21 Robust Human Registration with Body Part Segmentation on Noisy Point Clouds 提出一种结合身体部位分割的鲁棒人体注册方法,提升噪声点云下的姿态估计和分割精度。 SMPL SMPL-X

🔬 支柱一:机器人控制 (Robot Control) (1 篇)

#题目一句话要点标签🔗
22 3D Scene Understanding Through Local Random Access Sequence Modeling 提出局部随机访问序列建模(LRAS),用于提升单图三维场景理解的一致性和编辑能力。 manipulation depth estimation scene understanding

🔬 支柱四:生成式动作 (Generative Motion) (1 篇)

#题目一句话要点标签🔗
23 Shape My Moves: Text-Driven Shape-Aware Synthesis of Human Motions 提出Shape My Moves,解决文本驱动的、体型感知的动作生成问题。 text-to-motion motion synthesis motion generation

⬅️ 返回 cs.CV 首页 · 🏠 返回主页