cs.CV(2026-02-05)

📊 共 40 篇论文 | 🔗 6 篇有代码

🎯 兴趣领域导航

支柱三:空间感知与语义 (Perception & Semantics) (13 🔗2) 支柱二:RL算法与架构 (RL & Architecture) (11 🔗2) 支柱九:具身大模型 (Embodied Foundation Models) (11 🔗2) 支柱一:机器人控制 (Robot Control) (2) 支柱六:视频提取与匹配 (Video Extraction) (2) 支柱八:物理动画 (Physics-based Animation) (1)

🔬 支柱三:空间感知与语义 (Perception & Semantics) (13 篇)

#题目一句话要点标签🔗
1 Fast-SAM3D: 3Dfy Anything in Images but Faster Fast-SAM3D:加速图像三维重建,提升推理效率且保持精度。 sam 3D SAM 3D spatiotemporal
2 VGGT-Motion: Motion-Aware Calibration-Free Monocular SLAM for Long-Range Consistency VGGT-Motion:面向长距离一致性的无标定单目SLAM系统 optical flow VGGT feature matching
3 NeVStereo: A NeRF-Driven NVS-Stereo Architecture for High-Fidelity 3D Tasks NeVStereo:一种NeRF驱动的NVS-Stereo架构,用于高保真3D任务 depth estimation NeRF VGGT
4 LoGoSeg: Integrating Local and Global Features for Open-Vocabulary Semantic Segmentation LoGoSeg:融合局部与全局特征的开放词汇语义分割框架 open-vocabulary open vocabulary
5 ShapeGaussian: High-Fidelity 4D Human Reconstruction in Monocular Videos via Vision Priors ShapeGaussian:利用视觉先验从单目视频中高保真重建4D人体 scene reconstruction SMPL human motion
6 MTPano: Multi-Task Panoramic Scene Understanding via Label-Free Integration of Dense Prediction Priors MTPano:通过无标签密集预测先验集成实现多任务全景场景理解 scene understanding foundation model
7 Predicting Camera Pose from Perspective Descriptions for Spatial Reasoning 提出CAMCUE框架,利用相机位姿进行多视角空间推理和视角预测。 scene understanding large language model multimodal
8 NVS-HO: A Benchmark for Novel View Synthesis of Handheld Objects NVS-HO:首个手持物体新视角合成的RGB基准数据集 gaussian splatting splatting NeRF
9 MerNav: A Highly Generalizable Memory-Execute-Review Framework for Zero-Shot Object Goal Navigation 提出MerNav框架,解决零样本物体目标导航中泛化性与成功率难以兼顾的问题。 open-vocabulary open vocabulary VLN
10 PoseGaussian: Pose-Driven Novel View Synthesis for Robust 3D Human Reconstruction 提出PoseGaussian,利用姿态引导的高保真人体新视角合成框架 depth estimation gaussian splatting splatting
11 IndustryShapes: An RGB-D Benchmark dataset for 6D object pose estimation of industrial assembly components and tools IndustryShapes:用于工业装配组件和工具6D位姿估计的RGB-D基准数据集 6D pose estimation
12 Feature points evaluation on omnidirectional vision with a photorealistic fisheye sequence -- A report on experiments done in 2014 全向视觉特征点评估:基于真实感鱼眼序列的实验报告(2014年) visual odometry
13 Dual-Representation Image Compression at Ultra-Low Bitrates via Explicit Semantics and Implicit Textures 提出双重表征图像压缩框架,融合显式语义和隐式纹理,提升超低码率下压缩性能。 implicit representation

🔬 支柱二:RL算法与架构 (RL & Architecture) (11 篇)

#题目一句话要点标签🔗
14 UniSurg: A Video-Native Foundation Model for Universal Understanding of Surgical Videos UniSurg:面向手术视频通用理解的视频原生基础模型 distillation depth estimation motion prediction
15 MambaVF: State Space Model for Efficient Video Fusion MambaVF:基于状态空间模型的高效视频融合框架,无需光流估计。 Mamba SSM state space model
16 V-Retrver: Evidence-Driven Agentic Reasoning for Universal Multimodal Retrieval V-Retrver:提出证据驱动的Agentic推理框架,用于通用多模态检索。 reinforcement learning large language model multimodal
17 RFM-Pose:Reinforcement-Guided Flow Matching for Fast Category-Level 6D Pose Estimation RFM-Pose:基于强化学习引导的Flow Matching,加速类别级6D位姿估计 reinforcement learning flow matching 6D pose estimation
18 Splat and Distill: Augmenting Teachers with Feed-Forward 3D Reconstruction For 3D-Aware Distillation 提出Splat and Distill框架,通过前馈3D重建增强教师模型,提升2D视觉模型的3D感知能力。 distillation depth estimation monocular depth
19 VisRefiner: Learning from Visual Differences for Screenshot-to-Code Generation VisRefiner:通过学习视觉差异改进截图到代码的生成 reinforcement learning large language model multimodal
20 Weaver: End-to-End Agentic System Training for Video Interleaved Reasoning Weaver:提出端到端Agentic系统训练方法,用于视频交错推理。 reinforcement learning multimodal chain-of-thought
21 Dataset Distillation via Relative Distribution Matching and Cognitive Heritage 提出基于统计流匹配和认知继承的数据集蒸馏方法,降低计算和内存开销。 flow matching distillation
22 ReGLA: Efficient Receptive-Field Modeling with Gated Linear Attention Network 提出ReGLA:一种基于门控线性注意力网络的高效感受野建模方法,适用于高分辨率图像。 linear attention distillation
23 UI-Mem: Self-Evolving Experience Memory for Online Reinforcement Learning in Mobile GUI Agents UI-Mem:为移动GUI智能体提出自进化经验记忆的在线强化学习框架 reinforcement learning
24 FMPose3D: monocular 3D pose estimation via flow matching 提出FMPose3D,利用Flow Matching高效解决单目3D姿态估计中的深度模糊性问题。 flow matching

🔬 支柱九:具身大模型 (Embodied Foundation Models) (11 篇)

#题目一句话要点标签🔗
25 Multimodal Latent Reasoning via Hierarchical Visual Cues Injection 提出HIVE框架,通过层级视觉线索注入实现多模态潜在空间推理 large language model multimodal
26 Magic-MM-Embedding: Towards Visual-Token-Efficient Universal Multimodal Embedding with MLLMs 提出Magic-MM-Embedding,通过视觉token压缩和多阶段训练提升MLLM在通用多模态嵌入中的效率和性能。 large language model multimodal
27 SwimBird: Eliciting Switchable Reasoning Mode in Hybrid Autoregressive MLLMs SwimBird:提出一种混合自回归MLLM,实现可切换的推理模式以提升视觉密集任务性能。 large language model multimodal
28 Thinking with Geometry: Active Geometry Integration for Spatial Reasoning GeoThinker:通过主动几何集成增强多模态大语言模型中的空间推理能力 large language model multimodal
29 SOMA-1M: A Large-Scale SAR-Optical Multi-resolution Alignment Dataset for Multi-Task Remote Sensing 提出SOMA-1M大规模多分辨率SAR-光学影像对齐数据集,促进多模态遥感任务研究。 foundation model multimodal
30 E.M.Ground: A Temporal Grounding Vid-LLM with Holistic Event Perception and Matching E.M.Ground:一种时序定位Vid-LLM,具备整体事件感知和匹配能力 large language model TAMP
31 Sparse Video Generation Propels Real-World Beyond-the-View Vision-Language Navigation SparseVideoNav:利用稀疏视频生成实现真实场景下超越视野的视觉语言导航 large language model
32 RISE-Video: Can Video Generators Decode Implicit World Rules? 提出RISE-Video基准,评估文本到视频生成模型对隐式世界规则的理解能力。 multimodal
33 Adaptive Global and Fine-Grained Perceptual Fusion for MLLM Embeddings Compatible with Hard Negative Amplification 提出AGFF-Embed,融合全局与细粒度感知,提升MLLM嵌入性能。 multimodal
34 VRIQ: Benchmarking and Analyzing Visual-Reasoning IQ of VLMs VRIQ:提出视觉推理智商基准,分析VLMs在非语言推理中的局限性 multimodal
35 Wid3R: Wide Field-of-View 3D Reconstruction via Camera Model Conditioning Wid3R:通过相机模型条件化实现宽视场3D重建 foundation model

🔬 支柱一:机器人控制 (Robot Control) (2 篇)

#题目一句话要点标签🔗
36 InterPrior: Scaling Generative Control for Physics-Based Human-Object Interactions InterPrior:通过模仿学习和强化学习扩展物理交互生成控制 humanoid manipulation loco-manipulation
37 ShapeUP: Scalable Image-Conditioned 3D Editing ShapeUP:可扩展的图像条件3D编辑框架,实现精细可控的3D内容创作 manipulation geometric consistency foundation model

🔬 支柱六:视频提取与匹配 (Video Extraction) (2 篇)

#题目一句话要点标签🔗
38 EgoPoseVR: Spatiotemporal Multi-Modal Reasoning for Egocentric Full-Body Pose in Virtual Reality 提出EgoPoseVR以解决虚拟现实中的全身姿态估计问题 egocentric spatiotemporal
39 Allocentric Perceiver: Disentangling Allocentric Reasoning from Egocentric Visual Priors via Frame Instantiation Allocentric Perceiver:通过帧实例化解耦以自我为中心的视觉先验知识和以场景为中心的推理 egocentric

🔬 支柱八:物理动画 (Physics-based Animation) (1 篇)

#题目一句话要点标签🔗
40 GT-SVJ: Generative-Transformer-Based Self-Supervised Video Judge For Efficient Video Reward Modeling 提出基于生成Transformer的自监督视频评价模型,高效进行视频奖励建模。 spatiotemporal

⬅️ 返回 cs.CV 首页 · 🏠 返回主页