cs.CV（2025-03-21）

📊 共 40 篇论文 | 🔗 13 篇有代码

🎯 兴趣领域导航

支柱三：空间感知与语义 (Perception & Semantics) (13 🔗3) 支柱二：RL算法与架构 (RL & Architecture) (8 🔗3) 支柱八：物理动画 (Physics-based Animation) (8 🔗1) 支柱九：具身大模型 (Embodied Foundation Models) (6 🔗2) 支柱一：机器人控制 (Robot Control) (3 🔗2) 支柱四：生成式动作 (Generative Motion) (1 🔗1) 支柱五：交互与反应 (Interaction & Reaction) (1 🔗1)

🔬 支柱三：空间感知与语义 (Perception & Semantics) (13 篇)

#	题目	一句话要点	标签	🔗	⭐
1	DroneSplat: 3D Gaussian Splatting for Robust 3D Reconstruction from In-the-Wild Drone Imagery	DroneSplat：利用3D高斯溅射实现无人机野外图像的鲁棒三维重建	3D gaussian splatting 3DGS gaussian splatting
2	Optimized Minimal 3D Gaussian Splatting	提出OMG：优化最小3D高斯溅射，显著降低存储需求并保持高渲染质量。	3D gaussian splatting 3DGS gaussian splatting	✅
3	Is there anything left? Measuring semantic residuals of objects removed from 3D Gaussian Splatting	提出语义残留度量方法，评估3D高斯溅射中移除对象后的隐私保护效果	3D gaussian splatting gaussian splatting splatting
4	Instant Gaussian Stream: Fast and Generalizable Streaming of Dynamic Scene Reconstruction via Gaussian Splatting	提出Instant Gaussian Stream以解决动态场景重建的高延迟问题	gaussian splatting splatting scene reconstruction
5	An Iterative Feedback Mechanism for Improving Natural Language Class Descriptions in Open-Vocabulary Object Detection	提出一种迭代反馈机制，提升开放词汇目标检测中自然语言类描述的质量。	open-vocabulary open vocabulary
6	Superpowering Open-Vocabulary Object Detectors for X-ray Vision	RAXO：赋能X射线开放词汇目标检测，无需训练数据。	open-vocabulary open vocabulary	✅
7	ProtoGS: Efficient and High-Quality Rendering with 3D Gaussian Prototypes	ProtoGS：利用3D高斯原型实现高效高质量的渲染	3D gaussian splatting 3DGS gaussian splatting
8	Unsupervised Joint Learning of Optical Flow and Intensity with Event Cameras	提出一种基于事件相机的无监督光流与图像强度联合学习框架	optical flow	✅
9	ExCap3D: Expressive 3D Scene Understanding via Object Captioning with Varying Detail	ExCap3D：通过多粒度对象描述实现富有表现力的3D场景理解	scene understanding
10	Generating, Fast and Slow: Scalable Parallel Video Generation with Video Interface Networks	提出视频接口网络VINs，实现可扩展的并行视频生成，提升长视频生成效率与质量。	optical flow
11	Image as an IMU: Estimating Camera Motion from a Single Motion-Blurred Image	利用运动模糊图像估计相机运动，实现类IMU的快速运动捕捉	monocular depth
12	AnimatePainter: A Self-Supervised Rendering Framework for Reconstructing Painting Process	AnimatePainter：提出自监督渲染框架，重建绘画过程	depth estimation
13	Seg2Box: 3D Object Detection by Point-Wise Semantics Supervision	Seg2Box：提出一种仅使用语义标签监督的三维目标检测方法	scene understanding

🔬 支柱二：RL算法与架构 (RL & Architecture) (8 篇)

#	题目	一句话要点	标签	🔗	⭐
14	Radar-Guided Polynomial Fitting for Metric Depth Estimation	POLAR：利用雷达引导的多项式拟合实现精确的单目深度估计	MAE depth estimation monocular depth
15	TEMPLE: Incentivizing Temporal Understanding of Video Large Language Models via Progressive Pre-SFT Alignment	TEMPLE：通过渐进式预SFT对齐，激励视频大语言模型的时间理解能力	preference learning DPO direct preference optimization
16	Distilling Monocular Foundation Model for Fine-grained Depth Completion	提出双阶段知识蒸馏框架，利用单目基础模型提升细粒度深度补全性能	distillation depth estimation monocular depth	✅
17	VQToken: Neural Discrete Token Representation Learning for Extreme Token Reduction in Video Large Language Models	提出VQToken，用于视频大语言模型中极端Token缩减的神经离散Token表示学习。	representation learning large language model
18	OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles	OpenVLThinker：通过迭代SFT-RL循环实现复杂视觉语言推理	reinforcement learning multimodal visual grounding	✅
19	ARFlow: Human Action-Reaction Flow Matching with Physical Guidance	ARFlow：基于物理引导的人体动作-反应流匹配模型，解决交互合成中的物理穿透问题。	flow matching penetration reaction synthesis
20	MM-UNet: Meta Mamba UNet for Medical Image Segmentation	提出MM-UNet，利用Meta Mamba结构优化医学图像分割中的SSM应用	Mamba SSM state space model
21	Classifier-guided CLIP Distillation for Unsupervised Multi-label Classification	提出分类器引导的CLIP蒸馏方法，用于无监督多标签分类。	distillation	✅

🔬 支柱八：物理动画 (Physics-based Animation) (8 篇)

#	题目	一句话要点	标签	🔗	⭐
22	Spatiotemporal Learning with Context-aware Video Tubelets for Ultrasound Video Analysis	提出基于上下文感知的视频管的空时学习方法，用于超声视频分析	spatiotemporal
23	Recovering Pulse Waves from Video Using Deep Unrolling and Deep Equilibrium Models	提出结合深度学习与信号处理的iPPG脉搏波恢复方法	PULSE
24	UniCon: Unidirectional Information Flow for Effective Control of Large-Scale Diffusion Models	UniCon：单向信息流控制大规模扩散模型，提升训练效率与控制精度。	UniCon
25	Dynamic Attention Mechanism in Spatiotemporal Memory Networks for Object Tracking	提出动态注意力时空记忆网络(DASTM)，解决复杂场景下目标跟踪的特征选择与融合问题。	spatiotemporal
26	Time-Series U-Net with Recurrence for Noise-Robust Imaging Photoplethysmography	提出TURNIP：一种基于时序U-Net和循环机制的噪声鲁棒性iPPG脉搏信号估计方法	PULSE
27	Which2comm: An Efficient Collaborative Perception Framework for 3D Object Detection	提出Which2comm，利用语义检测框实现高效协同3D目标检测	spatiotemporal
28	Temporal-Guided Spiking Neural Networks for Event-Based Human Action Recognition	提出时序引导的脉冲神经网络，用于事件相机的人体行为识别	spatiotemporal
29	Enabling Versatile Controls for Video Diffusion Models	VCtrl：通过统一控制框架实现视频扩散模型的多样化控制	spatiotemporal	✅

🔬 支柱九：具身大模型 (Embodied Foundation Models) (6 篇)

#	题目	一句话要点	标签	🔗	⭐
30	LoRASculpt: Sculpting LoRA for Harmonizing General and Specialized Knowledge in Multimodal Large Language Models	LoRASculpt：通过剪裁LoRA调和多模态大模型中的通用与特定知识	large language model multimodal
31	ModalTune: Fine-Tuning Slide-Level Foundation Models with Multi-Modal Information for Multi-task Learning in Digital Pathology	提出ModalTune框架以解决数字病理中的多任务学习问题	foundation model
32	Meme Similarity and Emotion Detection using Multimodal Analysis	提出基于多模态CLIP模型的Meme相似度与情感检测方法，提升在线内容理解。	multimodal
33	Feature-Based Dual Visual Feature Extraction Model for Compound Multimodal Emotion Recognition	提出融合ViT和ResNet特征的双视觉特征提取模型，提升复杂场景下多模态情感识别性能	multimodal	✅
34	Beyond Semantics: Rediscovering Spatial Awareness in Vision-Language Models	揭示视觉-语言模型空间感知不足，提出可解释性工具并改进多模态注意力机制。	multimodal
35	PP-DocLayout: A Unified Document Layout Detection Model to Accelerate Large-Scale Data Construction	PP-DocLayout：统一文档布局检测模型，加速大规模数据构建	multimodal	✅

🔬 支柱一：机器人控制 (Robot Control) (3 篇)

#	题目	一句话要点	标签	🔗	⭐
36	TaoAvatar: Real-Time Lifelike Full-Body Talking Avatars for Augmented Reality via 3D Gaussian Splatting	TaoAvatar：基于3D高斯溅射的实时逼真全身可交互增强现实化身	Apple Vision Pro distillation 3D gaussian splatting
37	Physical Plausibility-aware Trajectory Prediction via Locomotion Embodiment	提出基于运动具身认知的轨迹预测框架，提升预测轨迹的物理合理性	locomotion motion generation	✅
38	Missing Target-Relevant Information Prediction with World Model for Accurate Zero-Shot Composed Image Retrieval	PrediCIR：利用世界模型预测缺失目标信息，提升零样本组合图像检索精度	manipulation world model	✅

🔬 支柱四：生成式动作 (Generative Motion) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
39	PRIMAL: Physically Reactive and Interactive Motor Model for Avatar Learning	PRIMAL：用于Avatar学习的物理交互式运动模型，提升真实感和响应性。	motion generation human motion character animation	✅

🔬 支柱五：交互与反应 (Interaction & Reaction) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
40	Re-HOLD: Video Hand Object Interaction Reenactment via adaptive Layout-instructed Diffusion Model	提出Re-HOLD框架，通过自适应布局引导扩散模型实现视频中手部与物体交互的重演	human-object interaction HOI	✅

⬅️ 返回 cs.CV 首页 · 🏠 返回主页