cs.CV(2025-04-03)

📊 共 26 篇论文 | 🔗 5 篇有代码

🎯 兴趣领域导航

支柱二:RL算法与架构 (RL & Architecture) (9 🔗1) 支柱九:具身大模型 (Embodied Foundation Models) (7 🔗3) 支柱三:空间感知与语义 (Perception & Semantics) (6 🔗1) 支柱一:机器人控制 (Robot Control) (2) 支柱四:生成式动作 (Generative Motion) (1) 支柱六:视频提取与匹配 (Video Extraction) (1)

🔬 支柱二:RL算法与架构 (RL & Architecture) (9 篇)

#题目一句话要点标签🔗
1 VARGPT-v1.1: Improve Visual Autoregressive Large Unified Model via Iterative Instruction Tuning and Reinforcement Learning VARGPT-v1.1:通过迭代指令调优和强化学习提升视觉自回归大统一模型 reinforcement learning DPO direct preference optimization
2 ConsDreamer: Advancing Multi-View Consistency for Zero-Shot Text-to-3D Generation ConsDreamer通过解耦视角偏差和几何一致性,提升零样本文本到3D生成的多视角一致性。 dreamer distillation 3D gaussian splatting
3 Refining CLIP's Spatial Awareness: A Visual-Centric Perspective 提出空间相关性蒸馏框架,提升CLIP在密集预测任务中的空间感知能力 distillation open-vocabulary open vocabulary
4 Agglomerating Large Vision Encoders via Distillation for VFSS Segmentation 提出基于知识蒸馏的视觉编码器聚合方法,用于提升医学图像分割性能。 distillation foundation model
5 Morpheus: Benchmarking Physical Reasoning of Video Generative Models with Real Physical Experiments Morpheus:通过真实物理实验评估视频生成模型中的物理推理能力 world model physically plausible foundation model
6 All-day Depth Completion via Thermal-LiDAR Fusion 提出基于对比学习和伪监督的COPS框架,实现全天候热成像-LiDAR深度补全。 contrastive learning monocular depth foundation model
7 Learning Phase Distortion with Selective State Space Models for Video Turbulence Mitigation 提出基于选择性状态空间模型的视频湍流抑制方法,提升长距离成像质量。 Mamba state space model
8 SelfMedHPM: Self Pre-training With Hard Patches Mining Masked Autoencoders For Medical Image Segmentation SelfMedHPM:基于难样本挖掘掩码自编码器的医学图像分割自监督预训练 masked autoencoder MAE
9 Learning Audio-guided Video Representation with Gated Attention for Video-Text Retrieval 提出AVIGATE模型,利用门控注意力机制和自适应对比损失提升音视频文本检索性能。 representation learning multimodal

🔬 支柱九:具身大模型 (Embodied Foundation Models) (7 篇)

#题目一句话要点标签🔗
10 OmniCam: Unified Multimodal Video Generation via Camera Control OmniCam:通过相机控制实现统一的多模态视频生成 large language model multimodal
11 STING-BEE: Towards Vision-Language Model for Real-World X-ray Baggage Security Inspection 提出STING-BEE,用于现实X光行李安检的视觉-语言模型 multimodal instruction following visual grounding
12 OmniTalker: One-shot Real-time Text-Driven Talking Audio-Video Generation With Multimodal Style Mimicking OmniTalker:基于文本的实时说话人音视频生成,实现多模态风格模仿 multimodal
13 MMTL-UniAD: A Unified Framework for Multimodal and Multi-Task Learning in Assistive Driving Perception MMTL-UniAD:用于辅助驾驶感知的多模态多任务统一框架 multimodal
14 Sparse Autoencoders Learn Monosemantic Features in Vision-Language Models 提出基于稀疏自编码器的视觉-语言模型单义性特征学习方法,提升可解释性和可控性。 large language model multimodal
15 LLM-Guided Evolution: An Autonomous Model Optimization for Object Detection 提出LLM引导的进化方法以优化目标检测模型 large language model
16 Multifaceted Evaluation of Audio-Visual Capability for MLLMs: Effectiveness, Efficiency, Generalizability and Robustness 多模态大语言模型音视频能力多维度评测框架,关注有效性、效率、泛化性和鲁棒性 large language model

🔬 支柱三:空间感知与语义 (Perception & Semantics) (6 篇)

#题目一句话要点标签🔗
17 Compressing 3D Gaussian Splatting by Noise-Substituted Vector Quantization 提出噪声替代矢量量化方法,用于压缩3D高斯溅射模型并加速渲染。 3D gaussian splatting 3DGS gaussian splatting
18 F-ViTA: Foundation Model Guided Visible to Thermal Translation 提出F-ViTA,利用基础模型引导可见光到热成像的转换,提升低光照场景理解能力。 scene understanding foundation model
19 MonoGS++: Fast and Accurate Monocular RGB Gaussian SLAM 提出MonoGS++以解决单目RGB SLAM的硬件依赖问题 visual odometry 3D gaussian splatting gaussian splatting
20 MultiNeRF: Multiple Watermark Embedding for Neural Radiance Fields MultiNeRF:用于神经辐射场的嵌入多重水印方法,实现3D内容归属 NeRF neural radiance field
21 L-LBVC: Long-Term Motion Estimation and Prediction for Learned Bi-Directional Video Compression L-LBVC:面向长时运动估计与预测的可学习双向视频压缩框架 optical flow
22 LPA3D: 3D Room-Level Scene Generation from In-the-Wild Images 提出LPA-GAN,从单张图像生成逼真、语义合理的3D室内场景。 NeRF

🔬 支柱一:机器人控制 (Robot Control) (2 篇)

#题目一句话要点标签🔗
23 DiSRT-In-Bed: Diffusion-Based Sim-to-Real Transfer Framework for In-Bed Human Mesh Recovery 提出基于扩散模型的Sim-to-Real框架DiSRT-In-Bed,用于卧床人体网格重建 sim-to-real human mesh recovery
24 Concept Lancet: Image Editing with Compositional Representation Transplant Concept Lancet:提出一种基于组合表示移植的图像编辑方法 manipulation

🔬 支柱四:生成式动作 (Generative Motion) (1 篇)

#题目一句话要点标签🔗
25 MG-MotionLLM: A Unified Framework for Motion Comprehension and Generation across Multiple Granularities 提出MG-MotionLLM,用于多粒度运动理解与生成,解决现有方法在细粒度运动任务上的局限性。 text-to-motion large language model

🔬 支柱六:视频提取与匹配 (Video Extraction) (1 篇)

#题目一句话要点标签🔗
26 Towards Generalizing Temporal Action Segmentation to Unseen Views 提出一种时序动作分割方法,提升模型在未见视角下的泛化能力 egocentric

⬅️ 返回 cs.CV 首页 · 🏠 返回主页