cs.CV（2026-04-21）

📊 共 36 篇论文 | 🔗 8 篇有代码

🎯 兴趣领域导航

支柱三：空间感知与语义 (Perception & Semantics) (14 🔗3) 支柱九：具身大模型 (Embodied Foundation Models) (11 🔗3) 支柱二：RL算法与架构 (RL & Architecture) (6) 支柱四：生成式动作 (Generative Motion) (3) 支柱六：视频提取与匹配 (Video Extraction) (1 🔗1) 支柱七：动作重定向 (Motion Retargeting) (1 🔗1)

🔬 支柱三：空间感知与语义 (Perception & Semantics) (14 篇)

#	题目	一句话要点	标签	🔗	⭐
1	InHabit: Leveraging Image Foundation Models for Scalable 3D Human Placement	InHabit：利用图像基础模型实现可扩展的3D人体放置	scene reconstruction physically plausible human-scene interaction
2	An Object-Centered Data Acquisition Method for 3D Gaussian Splatting using Mobile Phones	提出一种基于手机的物体中心3D高斯溅射数据采集方法	3D gaussian splatting 3DGS gaussian splatting
3	GRAFT: Geometric Refinement and Fitting Transformer for Human Scene Reconstruction	提出GRAFT，通过几何优化和拟合Transformer实现高质量人体-场景重建	scene reconstruction physically plausible penetration	✅
4	BALTIC: A Benchmark and Cross-Domain Strategy for 3D Reconstruction Across Air and Underwater Domains Under Varying Illumination	BALTIC：针对水空跨域和变光照条件下的三维重建基准与策略	3D gaussian splatting 3D reconstruction gaussian splatting
5	AdaGScale: Viewpoint-Adaptive Gaussian Scaling in 3D Gaussian Splatting to Reduce Gaussian-Tile Pairs	AdaGScale：视角自适应高斯缩放，减少3D高斯溅射中的高斯-瓦片对数量	3D gaussian splatting gaussian splatting splatting
6	Diff-SBSR: Learning Multimodal Feature-Enhanced Diffusion Models for Zero-Shot Sketch-Based 3D Shape Retrieval	Diff-SBSR：学习多模态特征增强的扩散模型，用于零样本草图的三维形状检索	open-vocabulary open vocabulary multimodal
7	CoCo-SAM3: Harnessing Concept Conflict in Open-Vocabulary Semantic Segmentation	CoCo-SAM3：利用概念冲突解决开放词汇语义分割问题	open-vocabulary open vocabulary
8	TransSplat: Unbalanced Semantic Transport for Language-Driven 3DGS Editing	TransSplat：通过非平衡语义传输实现语言驱动的3DGS编辑	3D gaussian splatting 3DGS gaussian splatting
9	Evaluation of Winning Solutions of 2025 Low Power Computer Vision Challenge	LPCVC 2025挑战赛优胜方案评测：推动低功耗计算机视觉发展	depth estimation monocular depth open-vocabulary
10	RAFT-MSF++: Temporal Geometry-Motion Feature Fusion for Self-Supervised Monocular Scene Flow	RAFT-MSF++：时序几何-运动特征融合的自监督单目场景流估计	scene flow	✅
11	Paparazzo: Active Mapping of Moving 3D Objects	Paparazzo：主动映射移动3D物体，实现动态场景精确重建	3D reconstruction scene understanding	✅
12	Face Anything: 4D Face Reconstruction from Any Image Sequence	提出基于规范面部点预测的4D人脸重建方法，解决动态人脸重建中的几何和对应关系歧义问题。	depth estimation
13	TESO: Online Tracking of Essential Matrix by Stochastic Optimization	TESO：基于随机优化的本质矩阵在线跟踪，用于立体相机长期标定。	stereo depth
14	Explore Like Humans: Autonomous Exploration with Online SG-Memo Construction for Embodied Agents	ABot-Explorer：利用在线SG-Memo构建实现类人自主探索	affordance

🔬 支柱九：具身大模型 (Embodied Foundation Models) (11 篇)

#	题目	一句话要点	标签	🔗	⭐
15	Unveiling Fine-Grained Visual Traces: Evaluating Multimodal Interleaved Reasoning Chains in Multimodal STEM Tasks	提出StepSTEM基准，用于细粒度评估多模态LLM在STEM任务中的推理链	large language model multimodal chain-of-thought	✅
16	Seeing Candidates at Scale: Multimodal LLMs for Visual Political Communication on Instagram	利用多模态LLM分析Instagram政治宣传：提升视觉政治传播分析能力	large language model multimodal
17	Benchmarking Vision Foundation Models for Domain-Generalizable Face Anti-Spoofing	提出基于自监督视觉Transformer的人脸反欺骗高效基线方法	foundation model multimodal	✅
18	DR-MMSearchAgent: Deepening Reasoning in Multimodal Search Agents	DR-MMSearchAgent：通过加深推理解决多模态搜索Agent中的交互崩溃问题。	multimodal
19	How Far Are Video Models from True Multimodal Reasoning?	提出CLVG-Bench评估框架，揭示视频模型在多模态推理上的局限性	multimodal
20	A Multi-Agent Framework with Structured Reasoning and Reflective Refinement for Multimodal Empathetic Response Generation	提出一种多智能体框架，通过结构化推理和反思精炼提升多模态情感共鸣回复生成效果。	multimodal
21	Bridging Foundation Models and ASTM Metallurgical Standards for Automated Grain Size Estimation from Microscopy Images	提出一种基于Cellpose-SAM的自动化晶粒尺寸估计方法，桥接基础模型与ASTM标准。	foundation model	✅
22	Air-Know: Arbiter-Calibrated Knowledge-Internalizing Robust Network for Composed Image Retrieval	提出Air-Know，解决Composed Image Retrieval中的噪声三元组对应问题	large language model multimodal
23	Deep sprite-based image models: An analysis	提出深度Sprite图像分解模型，解决图像中重复模式识别难题，实现可解释的无监督分割。	foundation model
24	DINO Eats CLIP: Adapting Beyond Knowns for Open-set 3D Object Retrieval	提出DINO Eats CLIP框架，通过动态多视角融合和虚拟特征合成，提升开放集3D物体检索性能。	foundation model
25	The Essence of Balance for Self-Improving Agents in Vision-and-Language Navigation	提出稳定性-多样性平衡机制，提升视觉-语言导航中自提升Agent的性能	VLN

🔬 支柱二：RL算法与架构 (RL & Architecture) (6 篇)

#	题目	一句话要点	标签	🔗	⭐
26	SpanVLA: Efficient Action Bridging and Learning from Negative-Recovery Samples for Vision-Language-Action Model	SpanVLA：通过负样本恢复学习和高效动作桥接，提升视觉-语言-动作模型性能	flow matching vision-language-action VLA
27	AnyRecon: Arbitrary-View 3D Reconstruction with Video Diffusion Model	AnyRecon：利用视频扩散模型实现任意视角下的三维重建	distillation 3D reconstruction geometric consistency
28	PanDA: Unsupervised Domain Adaptation for Multimodal 3D Panoptic Segmentation in Autonomous Driving	PanDA：面向自动驾驶多模态3D全景分割的无监督领域自适应框架	representation learning multimodal
29	Volume Transformer: Revisiting Vanilla Transformers for 3D Scene Understanding	提出Volume Transformer (Volt)，用于提升3D场景理解的通用性和可扩展性。	distillation scene understanding
30	HP-Edit: A Human-Preference Post-Training Framework for Image Editing	HP-Edit：面向图像编辑的人类偏好后训练框架，提升生成质量。	reinforcement learning RLHF DPO
31	PortraitDirector: A Hierarchical Disentanglement Framework for Controllable and Real-time Facial Reenactment	PortraitDirector：提出一种用于可控和实时面部重演的分层解耦框架	distillation motion latent

🔬 支柱四：生成式动作 (Generative Motion) (3 篇)

#	题目	一句话要点	标签	🔗	⭐
32	EgoMotion: Hierarchical Reasoning and Diffusion for Egocentric Vision-Language Motion Generation	提出EgoMotion框架，解决以视觉语言为条件的自我中心视角人体运动生成难题。	motion synthesis motion generation physically plausible
33	CoInteract: Physically-Consistent Human-Object Interaction Video Synthesis via Spatially-Structured Co-Generation	CoInteract：通过空间结构化协同生成实现物理一致的人-物交互视频合成	physically plausible penetration human-object interaction
34	A Network-Aware Evaluation of Distributed Energy Resource Control in Smart Distribution Systems	针对智能配电系统中分布式能源控制的网络感知评估框架	penetration

🔬 支柱六：视频提取与匹配 (Video Extraction) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
35	EgoSelf: From Memory to Personalized Egocentric Assistant	EgoSelf：构建个性化第一人称视角助手，利用图记忆实现长期用户行为建模。	egocentric first-person view	✅

🔬 支柱七：动作重定向 (Motion Retargeting) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
36	Generative Texture Filtering	提出一种生成式纹理滤波方法，利用预训练生成模型提升纹理去除效果。	structure preservation	✅

⬅️ 返回 cs.CV 首页 · 🏠 返回主页