cs.CV(2025-03-21)

📊 共 40 篇论文 | 🔗 13 篇有代码

🎯 兴趣领域导航

支柱三:空间感知与语义 (Perception & Semantics) (13 🔗3) 支柱二:RL算法与架构 (RL & Architecture) (8 🔗3) 支柱八:物理动画 (Physics-based Animation) (8 🔗1) 支柱九:具身大模型 (Embodied Foundation Models) (6 🔗2) 支柱一:机器人控制 (Robot Control) (3 🔗2) 支柱四:生成式动作 (Generative Motion) (1 🔗1) 支柱五:交互与反应 (Interaction & Reaction) (1 🔗1)

🔬 支柱三:空间感知与语义 (Perception & Semantics) (13 篇)

#题目一句话要点标签🔗
1 DroneSplat: 3D Gaussian Splatting for Robust 3D Reconstruction from In-the-Wild Drone Imagery DroneSplat:利用3D高斯溅射实现无人机野外图像的鲁棒三维重建 3D gaussian splatting 3DGS gaussian splatting
2 Optimized Minimal 3D Gaussian Splatting 提出OMG:优化最小3D高斯溅射,显著降低存储需求并保持高渲染质量。 3D gaussian splatting 3DGS gaussian splatting
3 Is there anything left? Measuring semantic residuals of objects removed from 3D Gaussian Splatting 提出语义残留度量方法,评估3D高斯溅射中移除对象后的隐私保护效果 3D gaussian splatting gaussian splatting splatting
4 Instant Gaussian Stream: Fast and Generalizable Streaming of Dynamic Scene Reconstruction via Gaussian Splatting 提出Instant Gaussian Stream以解决动态场景重建的高延迟问题 gaussian splatting splatting scene reconstruction
5 An Iterative Feedback Mechanism for Improving Natural Language Class Descriptions in Open-Vocabulary Object Detection 提出一种迭代反馈机制,提升开放词汇目标检测中自然语言类描述的质量。 open-vocabulary open vocabulary
6 Superpowering Open-Vocabulary Object Detectors for X-ray Vision RAXO:赋能X射线开放词汇目标检测,无需训练数据。 open-vocabulary open vocabulary
7 ProtoGS: Efficient and High-Quality Rendering with 3D Gaussian Prototypes ProtoGS:利用3D高斯原型实现高效高质量的渲染 3D gaussian splatting 3DGS gaussian splatting
8 Unsupervised Joint Learning of Optical Flow and Intensity with Event Cameras 提出一种基于事件相机的无监督光流与图像强度联合学习框架 optical flow
9 ExCap3D: Expressive 3D Scene Understanding via Object Captioning with Varying Detail ExCap3D:通过多粒度对象描述实现富有表现力的3D场景理解 scene understanding
10 Generating, Fast and Slow: Scalable Parallel Video Generation with Video Interface Networks 提出视频接口网络VINs,实现可扩展的并行视频生成,提升长视频生成效率与质量。 optical flow
11 Image as an IMU: Estimating Camera Motion from a Single Motion-Blurred Image 利用运动模糊图像估计相机运动,实现类IMU的快速运动捕捉 monocular depth
12 AnimatePainter: A Self-Supervised Rendering Framework for Reconstructing Painting Process AnimatePainter:提出自监督渲染框架,重建绘画过程 depth estimation
13 Seg2Box: 3D Object Detection by Point-Wise Semantics Supervision Seg2Box:提出一种仅使用语义标签监督的三维目标检测方法 scene understanding

🔬 支柱二:RL算法与架构 (RL & Architecture) (8 篇)

#题目一句话要点标签🔗
14 Radar-Guided Polynomial Fitting for Metric Depth Estimation POLAR:利用雷达引导的多项式拟合实现精确的单目深度估计 MAE depth estimation monocular depth
15 TEMPLE: Incentivizing Temporal Understanding of Video Large Language Models via Progressive Pre-SFT Alignment TEMPLE:通过渐进式预SFT对齐,激励视频大语言模型的时间理解能力 preference learning DPO direct preference optimization
16 Distilling Monocular Foundation Model for Fine-grained Depth Completion 提出双阶段知识蒸馏框架,利用单目基础模型提升细粒度深度补全性能 distillation depth estimation monocular depth
17 VQToken: Neural Discrete Token Representation Learning for Extreme Token Reduction in Video Large Language Models 提出VQToken,用于视频大语言模型中极端Token缩减的神经离散Token表示学习。 representation learning large language model
18 OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles OpenVLThinker:通过迭代SFT-RL循环实现复杂视觉语言推理 reinforcement learning multimodal visual grounding
19 ARFlow: Human Action-Reaction Flow Matching with Physical Guidance ARFlow:基于物理引导的人体动作-反应流匹配模型,解决交互合成中的物理穿透问题。 flow matching penetration reaction synthesis
20 MM-UNet: Meta Mamba UNet for Medical Image Segmentation 提出MM-UNet,利用Meta Mamba结构优化医学图像分割中的SSM应用 Mamba SSM state space model
21 Classifier-guided CLIP Distillation for Unsupervised Multi-label Classification 提出分类器引导的CLIP蒸馏方法,用于无监督多标签分类。 distillation

🔬 支柱八:物理动画 (Physics-based Animation) (8 篇)

#题目一句话要点标签🔗
22 Spatiotemporal Learning with Context-aware Video Tubelets for Ultrasound Video Analysis 提出基于上下文感知的视频管的空时学习方法,用于超声视频分析 spatiotemporal
23 Recovering Pulse Waves from Video Using Deep Unrolling and Deep Equilibrium Models 提出结合深度学习与信号处理的iPPG脉搏波恢复方法 PULSE
24 UniCon: Unidirectional Information Flow for Effective Control of Large-Scale Diffusion Models UniCon:单向信息流控制大规模扩散模型,提升训练效率与控制精度。 UniCon
25 Dynamic Attention Mechanism in Spatiotemporal Memory Networks for Object Tracking 提出动态注意力时空记忆网络(DASTM),解决复杂场景下目标跟踪的特征选择与融合问题。 spatiotemporal
26 Time-Series U-Net with Recurrence for Noise-Robust Imaging Photoplethysmography 提出TURNIP:一种基于时序U-Net和循环机制的噪声鲁棒性iPPG脉搏信号估计方法 PULSE
27 Which2comm: An Efficient Collaborative Perception Framework for 3D Object Detection 提出Which2comm,利用语义检测框实现高效协同3D目标检测 spatiotemporal
28 Temporal-Guided Spiking Neural Networks for Event-Based Human Action Recognition 提出时序引导的脉冲神经网络,用于事件相机的人体行为识别 spatiotemporal
29 Enabling Versatile Controls for Video Diffusion Models VCtrl:通过统一控制框架实现视频扩散模型的多样化控制 spatiotemporal

🔬 支柱九:具身大模型 (Embodied Foundation Models) (6 篇)

#题目一句话要点标签🔗
30 LoRASculpt: Sculpting LoRA for Harmonizing General and Specialized Knowledge in Multimodal Large Language Models LoRASculpt:通过剪裁LoRA调和多模态大模型中的通用与特定知识 large language model multimodal
31 ModalTune: Fine-Tuning Slide-Level Foundation Models with Multi-Modal Information for Multi-task Learning in Digital Pathology 提出ModalTune框架以解决数字病理中的多任务学习问题 foundation model
32 Meme Similarity and Emotion Detection using Multimodal Analysis 提出基于多模态CLIP模型的Meme相似度与情感检测方法,提升在线内容理解。 multimodal
33 Feature-Based Dual Visual Feature Extraction Model for Compound Multimodal Emotion Recognition 提出融合ViT和ResNet特征的双视觉特征提取模型,提升复杂场景下多模态情感识别性能 multimodal
34 Beyond Semantics: Rediscovering Spatial Awareness in Vision-Language Models 揭示视觉-语言模型空间感知不足,提出可解释性工具并改进多模态注意力机制。 multimodal
35 PP-DocLayout: A Unified Document Layout Detection Model to Accelerate Large-Scale Data Construction PP-DocLayout:统一文档布局检测模型,加速大规模数据构建 multimodal

🔬 支柱一:机器人控制 (Robot Control) (3 篇)

#题目一句话要点标签🔗
36 TaoAvatar: Real-Time Lifelike Full-Body Talking Avatars for Augmented Reality via 3D Gaussian Splatting TaoAvatar:基于3D高斯溅射的实时逼真全身可交互增强现实化身 Apple Vision Pro distillation 3D gaussian splatting
37 Physical Plausibility-aware Trajectory Prediction via Locomotion Embodiment 提出基于运动具身认知的轨迹预测框架,提升预测轨迹的物理合理性 locomotion motion generation
38 Missing Target-Relevant Information Prediction with World Model for Accurate Zero-Shot Composed Image Retrieval PrediCIR:利用世界模型预测缺失目标信息,提升零样本组合图像检索精度 manipulation world model

🔬 支柱四:生成式动作 (Generative Motion) (1 篇)

#题目一句话要点标签🔗
39 PRIMAL: Physically Reactive and Interactive Motor Model for Avatar Learning PRIMAL:用于Avatar学习的物理交互式运动模型,提升真实感和响应性。 motion generation human motion character animation

🔬 支柱五:交互与反应 (Interaction & Reaction) (1 篇)

#题目一句话要点标签🔗
40 Re-HOLD: Video Hand Object Interaction Reenactment via adaptive Layout-instructed Diffusion Model 提出Re-HOLD框架,通过自适应布局引导扩散模型实现视频中手部与物体交互的重演 human-object interaction HOI

⬅️ 返回 cs.CV 首页 · 🏠 返回主页