cs.CV（2026-02-05）

📊 共 40 篇论文 | 🔗 6 篇有代码

🎯 兴趣领域导航

支柱三：空间感知与语义 (Perception & Semantics) (13 🔗2) 支柱二：RL算法与架构 (RL & Architecture) (11 🔗2) 支柱九：具身大模型 (Embodied Foundation Models) (11 🔗2) 支柱一：机器人控制 (Robot Control) (2) 支柱六：视频提取与匹配 (Video Extraction) (2) 支柱八：物理动画 (Physics-based Animation) (1)

🔬 支柱三：空间感知与语义 (Perception & Semantics) (13 篇)

#	题目	一句话要点	标签	🔗	⭐
1	Fast-SAM3D: 3Dfy Anything in Images but Faster	Fast-SAM3D：加速图像三维重建，提升推理效率且保持精度。	sam 3D SAM 3D spatiotemporal	✅
2	VGGT-Motion: Motion-Aware Calibration-Free Monocular SLAM for Long-Range Consistency	VGGT-Motion：面向长距离一致性的无标定单目SLAM系统	optical flow VGGT feature matching
3	NeVStereo: A NeRF-Driven NVS-Stereo Architecture for High-Fidelity 3D Tasks	NeVStereo：一种NeRF驱动的NVS-Stereo架构，用于高保真3D任务	depth estimation NeRF VGGT
4	LoGoSeg: Integrating Local and Global Features for Open-Vocabulary Semantic Segmentation	LoGoSeg：融合局部与全局特征的开放词汇语义分割框架	open-vocabulary open vocabulary
5	ShapeGaussian: High-Fidelity 4D Human Reconstruction in Monocular Videos via Vision Priors	ShapeGaussian：利用视觉先验从单目视频中高保真重建4D人体	scene reconstruction SMPL human motion
6	MTPano: Multi-Task Panoramic Scene Understanding via Label-Free Integration of Dense Prediction Priors	MTPano：通过无标签密集预测先验集成实现多任务全景场景理解	scene understanding foundation model
7	Predicting Camera Pose from Perspective Descriptions for Spatial Reasoning	提出CAMCUE框架，利用相机位姿进行多视角空间推理和视角预测。	scene understanding large language model multimodal
8	NVS-HO: A Benchmark for Novel View Synthesis of Handheld Objects	NVS-HO：首个手持物体新视角合成的RGB基准数据集	gaussian splatting splatting NeRF
9	MerNav: A Highly Generalizable Memory-Execute-Review Framework for Zero-Shot Object Goal Navigation	提出MerNav框架，解决零样本物体目标导航中泛化性与成功率难以兼顾的问题。	open-vocabulary open vocabulary VLN
10	PoseGaussian: Pose-Driven Novel View Synthesis for Robust 3D Human Reconstruction	提出PoseGaussian，利用姿态引导的高保真人体新视角合成框架	depth estimation gaussian splatting splatting
11	IndustryShapes: An RGB-D Benchmark dataset for 6D object pose estimation of industrial assembly components and tools	IndustryShapes：用于工业装配组件和工具6D位姿估计的RGB-D基准数据集	6D pose estimation	✅
12	Feature points evaluation on omnidirectional vision with a photorealistic fisheye sequence -- A report on experiments done in 2014	全向视觉特征点评估：基于真实感鱼眼序列的实验报告（2014年）	visual odometry
13	Dual-Representation Image Compression at Ultra-Low Bitrates via Explicit Semantics and Implicit Textures	提出双重表征图像压缩框架，融合显式语义和隐式纹理，提升超低码率下压缩性能。	implicit representation

🔬 支柱二：RL算法与架构 (RL & Architecture) (11 篇)

#	题目	一句话要点	标签	🔗	⭐
14	UniSurg: A Video-Native Foundation Model for Universal Understanding of Surgical Videos	UniSurg：面向手术视频通用理解的视频原生基础模型	distillation depth estimation motion prediction
15	MambaVF: State Space Model for Efficient Video Fusion	MambaVF：基于状态空间模型的高效视频融合框架，无需光流估计。	Mamba SSM state space model
16	V-Retrver: Evidence-Driven Agentic Reasoning for Universal Multimodal Retrieval	V-Retrver：提出证据驱动的Agentic推理框架，用于通用多模态检索。	reinforcement learning large language model multimodal
17	RFM-Pose:Reinforcement-Guided Flow Matching for Fast Category-Level 6D Pose Estimation	RFM-Pose：基于强化学习引导的Flow Matching，加速类别级6D位姿估计	reinforcement learning flow matching 6D pose estimation
18	Splat and Distill: Augmenting Teachers with Feed-Forward 3D Reconstruction For 3D-Aware Distillation	提出Splat and Distill框架，通过前馈3D重建增强教师模型，提升2D视觉模型的3D感知能力。	distillation depth estimation monocular depth	✅
19	VisRefiner: Learning from Visual Differences for Screenshot-to-Code Generation	VisRefiner：通过学习视觉差异改进截图到代码的生成	reinforcement learning large language model multimodal
20	Weaver: End-to-End Agentic System Training for Video Interleaved Reasoning	Weaver：提出端到端Agentic系统训练方法，用于视频交错推理。	reinforcement learning multimodal chain-of-thought
21	Dataset Distillation via Relative Distribution Matching and Cognitive Heritage	提出基于统计流匹配和认知继承的数据集蒸馏方法，降低计算和内存开销。	flow matching distillation
22	ReGLA: Efficient Receptive-Field Modeling with Gated Linear Attention Network	提出ReGLA：一种基于门控线性注意力网络的高效感受野建模方法，适用于高分辨率图像。	linear attention distillation
23	UI-Mem: Self-Evolving Experience Memory for Online Reinforcement Learning in Mobile GUI Agents	UI-Mem：为移动GUI智能体提出自进化经验记忆的在线强化学习框架	reinforcement learning
24	FMPose3D: monocular 3D pose estimation via flow matching	提出FMPose3D，利用Flow Matching高效解决单目3D姿态估计中的深度模糊性问题。	flow matching	✅

🔬 支柱九：具身大模型 (Embodied Foundation Models) (11 篇)

#	题目	一句话要点	标签	🔗	⭐
25	Multimodal Latent Reasoning via Hierarchical Visual Cues Injection	提出HIVE框架，通过层级视觉线索注入实现多模态潜在空间推理	large language model multimodal
26	Magic-MM-Embedding: Towards Visual-Token-Efficient Universal Multimodal Embedding with MLLMs	提出Magic-MM-Embedding，通过视觉token压缩和多阶段训练提升MLLM在通用多模态嵌入中的效率和性能。	large language model multimodal
27	SwimBird: Eliciting Switchable Reasoning Mode in Hybrid Autoregressive MLLMs	SwimBird：提出一种混合自回归MLLM，实现可切换的推理模式以提升视觉密集任务性能。	large language model multimodal
28	Thinking with Geometry: Active Geometry Integration for Spatial Reasoning	GeoThinker：通过主动几何集成增强多模态大语言模型中的空间推理能力	large language model multimodal	✅
29	SOMA-1M: A Large-Scale SAR-Optical Multi-resolution Alignment Dataset for Multi-Task Remote Sensing	提出SOMA-1M大规模多分辨率SAR-光学影像对齐数据集，促进多模态遥感任务研究。	foundation model multimodal	✅
30	E.M.Ground: A Temporal Grounding Vid-LLM with Holistic Event Perception and Matching	E.M.Ground：一种时序定位Vid-LLM，具备整体事件感知和匹配能力	large language model TAMP
31	Sparse Video Generation Propels Real-World Beyond-the-View Vision-Language Navigation	SparseVideoNav：利用稀疏视频生成实现真实场景下超越视野的视觉语言导航	large language model
32	RISE-Video: Can Video Generators Decode Implicit World Rules?	提出RISE-Video基准，评估文本到视频生成模型对隐式世界规则的理解能力。	multimodal
33	Adaptive Global and Fine-Grained Perceptual Fusion for MLLM Embeddings Compatible with Hard Negative Amplification	提出AGFF-Embed，融合全局与细粒度感知，提升MLLM嵌入性能。	multimodal
34	VRIQ: Benchmarking and Analyzing Visual-Reasoning IQ of VLMs	VRIQ：提出视觉推理智商基准，分析VLMs在非语言推理中的局限性	multimodal
35	Wid3R: Wide Field-of-View 3D Reconstruction via Camera Model Conditioning	Wid3R：通过相机模型条件化实现宽视场3D重建	foundation model

🔬 支柱一：机器人控制 (Robot Control) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
36	InterPrior: Scaling Generative Control for Physics-Based Human-Object Interactions	InterPrior：通过模仿学习和强化学习扩展物理交互生成控制	humanoid manipulation loco-manipulation
37	ShapeUP: Scalable Image-Conditioned 3D Editing	ShapeUP：可扩展的图像条件3D编辑框架，实现精细可控的3D内容创作	manipulation geometric consistency foundation model

🔬 支柱六：视频提取与匹配 (Video Extraction) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
38	EgoPoseVR: Spatiotemporal Multi-Modal Reasoning for Egocentric Full-Body Pose in Virtual Reality	提出EgoPoseVR以解决虚拟现实中的全身姿态估计问题	egocentric spatiotemporal
39	Allocentric Perceiver: Disentangling Allocentric Reasoning from Egocentric Visual Priors via Frame Instantiation	Allocentric Perceiver：通过帧实例化解耦以自我为中心的视觉先验知识和以场景为中心的推理	egocentric

🔬 支柱八：物理动画 (Physics-based Animation) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
40	GT-SVJ: Generative-Transformer-Based Self-Supervised Video Judge For Efficient Video Reward Modeling	提出基于生成Transformer的自监督视频评价模型，高效进行视频奖励建模。	spatiotemporal

⬅️ 返回 cs.CV 首页 · 🏠 返回主页