cs.CV(2025-02-17)

📊 共 18 篇论文 | 🔗 7 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (8 🔗3) 支柱三:空间感知与语义 (Perception & Semantics) (5 🔗1) 支柱二:RL算法与架构 (RL & Architecture) (3 🔗1) 支柱四:生成式动作 (Generative Motion) (1 🔗1) 支柱一:机器人控制 (Robot Control) (1 🔗1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (8 篇)

#题目一句话要点标签🔗
1 NOTA: Multimodal Music Notation Understanding for Visual Large Language Model 提出NOTA数据集与NotaGPT模型,提升视觉大语言模型对乐谱的理解能力 large language model multimodal
2 PRISM: Self-Pruning Intrinsic Selection Method for Training-Free Multimodal Data Selection PRISM:一种免训练的多模态数据自剪枝选择方法,解决视觉特征分布各向异性问题。 large language model multimodal
3 Language Models Can See Better: Visual Contrastive Decoding For LLM Multimodal Reasoning 提出基于模块化视觉对比解码(MVCD)框架,提升LLM在多模态推理中的视觉感知能力。 large language model multimodal
4 Token Communications: A Large Model-Driven Framework for Cross-modal Context-aware Semantic Communications 提出Token Communications框架,利用大模型驱动跨模态上下文感知语义通信。 large language model foundation model multimodal
5 Defining and Evaluating Visual Language Models' Basic Spatial Abilities: A Perspective from Psychometrics 构建心理测量框架,评估视觉语言模型的基本空间能力 embodied AI chain-of-thought
6 Intuitive physics understanding emerges from self-supervised pretraining on natural videos 利用自然视频自监督预训练,模型涌现直观物理理解能力 large language model multimodal
7 Detecting Systematic Weaknesses in Vision Models along Predefined Human-Understandable Dimensions 提出结合基础模型与组合搜索的算法,检测视觉模型中沿预定义维度存在的系统性弱点。 foundation model
8 Duo Streamers: A Streaming Gesture Recognition Framework Duo Streamers:一种用于资源受限场景的流式手势识别框架 multimodal

🔬 支柱三:空间感知与语义 (Perception & Semantics) (5 篇)

#题目一句话要点标签🔗
9 PUGS: Zero-shot Physical Understanding with Gaussian Splatting PUGS:基于高斯溅射的零样本物理属性理解方法 gaussian splatting splatting
10 From Open-Vocabulary to Vocabulary-Free Semantic Segmentation 提出Vocabulary-Free语义分割,无需预定义类别即可识别场景中的物体。 open-vocabulary open vocabulary
11 3D Gaussian Inpainting with Depth-Guided Cross-View Consistency 提出3DGIC,利用深度引导的跨视角一致性实现3D高斯Inpainting 3D gaussian splatting 3DGS gaussian splatting
12 Deep Neural Networks for Accurate Depth Estimation with Latent Space Features 提出基于潜在空间特征的深度神经网络,提升单目深度估计精度,尤其在室内场景。 depth estimation monocular depth scene reconstruction
13 HumanGif: Single-View Human Diffusion with Generative Prior HumanGif:利用生成先验的单视图人像扩散模型,实现逼真的人体动画生成。 NeRF character animation

🔬 支柱二:RL算法与架构 (RL & Architecture) (3 篇)

#题目一句话要点标签🔗
14 HermesFlow: Seamlessly Closing the Gap in Multimodal Understanding and Generation HermesFlow:弥合多模态理解与生成能力差距的通用框架 DPO large language model foundation model
15 High-Dynamic Radar Sequence Prediction for Weather Nowcasting Using Spatiotemporal Coherent Gaussian Representation 提出时空一致高斯表示与GauMamba,用于高动态天气雷达序列预测 Mamba gaussian splatting splatting
16 video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model 提出video-SALMONN-o1,首个面向通用视频理解的推理增强型音视频大语言模型。 direct preference optimization large language model multimodal

🔬 支柱四:生成式动作 (Generative Motion) (1 篇)

#题目一句话要点标签🔗
17 Diffusion Models without Classifier-free Guidance 提出Model-guidance训练扩散模型,无需Classifier-free guidance,提升训练和推理效率。 classifier-free guidance

🔬 支柱一:机器人控制 (Robot Control) (1 篇)

#题目一句话要点标签🔗
18 Diffusion-Sharpening: Fine-tuning Diffusion Models with Denoising Trajectory Sharpening Diffusion-Sharpening:通过去噪轨迹锐化微调扩散模型,提升下游任务对齐。 trajectory optimization DPO

⬅️ 返回 cs.CV 首页 · 🏠 返回主页