cs.CV（2025-04-04）

📊 共 23 篇论文 | 🔗 7 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (9 🔗2) 支柱二：RL算法与架构 (RL & Architecture) (6 🔗3) 支柱三：空间感知与语义 (Perception & Semantics) (4 🔗2) 支柱六：视频提取与匹配 (Video Extraction) (2) 支柱一：机器人控制 (Robot Control) (1) 支柱四：生成式动作 (Generative Motion) (1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (9 篇)

#	题目	一句话要点	标签	🔗	⭐
1	MME-Unify: A Comprehensive Benchmark for Unified Multimodal Understanding and Generation Models	MME-Unify：一个用于统一多模态理解与生成模型的综合性评测基准。	multimodal	✅
2	Multimodal Diffusion Bridge with Attention-Based SAR Fusion for Satellite Image Cloud Removal	提出DB-CR：一种基于注意力SAR融合的多模态扩散桥卫星图像去云方法	multimodal
3	RANa: Retrieval-Augmented Navigation	提出RANa：一种检索增强的导航方法，利用历史经验提升机器人导航性能。	foundation model zero-shot transfer
4	VISTA-OCR: Towards generative and interactive end to end OCR models	提出VISTA-OCR，一个生成式交互式端到端OCR模型，统一文本检测与识别。	large language model multimodal
5	VideoComp: Advancing Fine-Grained Compositional and Temporal Alignment in Video-Text Models	VideoComp：提升视频-文本模型在细粒度组合性和时间对齐方面的能力	multimodal
6	Can ChatGPT Learn My Life From a Week of First-Person Video?	利用第一人称视频，探索ChatGPT学习个人生活信息的能力	foundation model
7	ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use	ScreenSpot-Pro：针对专业高分辨率计算机使用的GUI定位基准与ScreenSeekeR方法	large language model	✅
8	Know What You do Not Know: Verbalized Uncertainty Estimation Robustness on Corrupted Images in Vision-Language Models	研究视觉语言模型在图像损坏下的不确定性估计鲁棒性问题	large language model
9	TokenFLEX: Unified VLM Training for Flexible Visual Tokens Inference	TokenFLEX：提出一种统一的VLM训练框架，实现视觉tokens数量的灵活推理。	large language model

🔬 支柱二：RL算法与架构 (RL & Architecture) (6 篇)

#	题目	一句话要点	标签	🔗	⭐
10	HumanDreamer-X: Photorealistic Single-image Human Avatars Reconstruction via Gaussian Restoration	提出HumanDreamer-X以解决单图人类重建中的几何不一致问题	dreamer 3D gaussian splatting 3DGS
11	Mamba as a Bridge: Where Vision Foundation Models Meet Vision Language Models for Domain-Generalized Semantic Segmentation	提出MFuser，利用Mamba融合视觉和视觉-语言模型，提升领域泛化语义分割性能。	Mamba foundation model	✅
12	RingMoE: Mixture-of-Modality-Experts Multi-Modal Foundation Models for Universal Remote Sensing Image Interpretation	提出RingMoE，用于通用遥感图像理解的多模态混合专家模型	representation learning depth estimation foundation model
13	LV-MAE: Learning Long Video Representations through Masked-Embedding Autoencoders	提出LV-MAE，通过掩码嵌入自编码器学习长视频表征，提升长视频理解能力。	MAE spatiotemporal multimodal	✅
14	Pyramid-based Mamba Multi-class Unsupervised Anomaly Detection	提出基于金字塔Mamba的多类别无监督异常检测方法，提升小异常定位精度。	Mamba SSM state space model	✅
15	Joint Retrieval of Cloud properties using Attention-based Deep Learning Models	提出基于注意力机制的CloudUNet模型，用于云光学厚度和有效半径的联合反演。	MAE spatial relationship

🔬 支柱三：空间感知与语义 (Perception & Semantics) (4 篇)

#	题目	一句话要点	标签	🔗	⭐
16	Scaling Open-Vocabulary Action Detection	提出一种可扩展的开放词汇动作检测方法，解决现有方法对大规模数据集和参数量大的依赖。	open-vocabulary open vocabulary multimodal	✅
17	WildGS-SLAM: Monocular Gaussian Splatting SLAM in Dynamic Environments	WildGS-SLAM：单目高斯溅射SLAM，解决动态环境下的鲁棒建图问题	gaussian splatting splatting
18	SARLANG-1M: A Benchmark for Vision-Language Modeling in SAR Image Understanding	提出SARLANG-1M：用于SAR图像理解的视觉-语言建模基准	open-vocabulary open vocabulary penetration	✅
19	FaR: Enhancing Multi-Concept Text-to-Image Diffusion via Concept Fusion and Localized Refinement	FaR：通过概念融合和局部细化增强多概念文本到图像扩散模型	concept fusion

🔬 支柱六：视频提取与匹配 (Video Extraction) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
20	ProbRes: Probabilistic Jump Diffusion for Open-World Egocentric Activity Recognition	ProbRes：基于概率跳跃扩散的开放世界自我中心活动识别	egocentric
21	Robust Human Registration with Body Part Segmentation on Noisy Point Clouds	提出一种结合身体部位分割的鲁棒人体注册方法，提升噪声点云下的姿态估计和分割精度。	SMPL SMPL-X

🔬 支柱一：机器人控制 (Robot Control) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
22	3D Scene Understanding Through Local Random Access Sequence Modeling	提出局部随机访问序列建模(LRAS)，用于提升单图三维场景理解的一致性和编辑能力。	manipulation depth estimation scene understanding

🔬 支柱四：生成式动作 (Generative Motion) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
23	Shape My Moves: Text-Driven Shape-Aware Synthesis of Human Motions	提出Shape My Moves，解决文本驱动的、体型感知的动作生成问题。	text-to-motion motion synthesis motion generation

⬅️ 返回 cs.CV 首页 · 🏠 返回主页