cs.CV（2026-04-23）

📊 共 32 篇论文 | 🔗 9 篇有代码

🎯 兴趣领域导航

支柱三：空间感知与语义 (Perception & Semantics) (7 🔗3) 支柱九：具身大模型 (Embodied Foundation Models) (7 🔗3) 支柱二：RL算法与架构 (RL & Architecture) (6 🔗1) 支柱一：机器人控制 (Robot Control) (4 🔗1) 支柱八：物理动画 (Physics-based Animation) (4) 支柱七：动作重定向 (Motion Retargeting) (2 🔗1) 支柱六：视频提取与匹配 (Video Extraction) (2)

🔬 支柱三：空间感知与语义 (Perception & Semantics) (7 篇)

#	题目	一句话要点	标签	🔗	⭐
1	DualSplat: Robust 3D Gaussian Splatting via Pseudo-Mask Bootstrapping from Reconstruction Failures	DualSplat：利用重建失败的伪掩码引导，实现鲁棒的3D高斯溅射	3D gaussian splatting 3DGS gaussian splatting
2	You Only Gaussian Once: Controllable 3D Gaussian Splatting for Ultra-Densely Sampled Scenes	YOGO：面向超密集场景的可控3D高斯溅射，弥合工业界与学术界差距	3D gaussian splatting 3DGS gaussian splatting	✅
3	WildSplatter: Feed-forward 3D Gaussian Splatting with Appearance Control from Unconstrained Images	WildSplatter：基于无约束图像和外观控制的前馈3D高斯溅射	3D gaussian splatting 3DGS gaussian splatting
4	Seeing Without Eyes: 4D Human-Scene Understanding from Wearable IMUs	提出IMU-to-4D框架，利用可穿戴IMU实现人体-场景4D重建，解决视觉依赖问题	scene understanding human motion spatiotemporal
5	Vista4D: Video Reshooting with 4D Point Clouds	Vista4D：提出基于4D点云的视频重拍摄框架，提升动态视频的视角控制和视觉质量。	depth estimation	✅
6	Unlocking the Power of Critical Factors for 3D Visual Geometry Estimation	CARVE：通过关键因素分析与高分辨率增强，提升3D视觉几何估计性能	depth estimation
7	You Only Gaussian Once: Controllable 3D Gaussian Splatting for Ultra-Densely Sampled Scenes	YOGO：面向超密集场景的可控3D高斯溅射，解决工业界应用难题	3D gaussian splatting 3DGS gaussian splatting	✅

🔬 支柱九：具身大模型 (Embodied Foundation Models) (7 篇)

#	题目	一句话要点	标签	🔗	⭐
8	VG-CoT: Towards Trustworthy Visual Reasoning via Grounded Chain-of-Thought	提出VG-CoT数据集，通过视觉证据 grounding 提升LVLM的可信视觉推理能力	visual grounding chain-of-thought
9	Attention-based multiple instance learning for predominant growth pattern prediction in lung adenocarcinoma wsi using foundation models	提出基于注意力的多实例学习框架以预测肺腺癌生长模式	foundation model
10	MiMIC: Mitigating Visual Modality Collapse in Universal Multimodal Retrieval While Avoiding Semantic Misalignment	MiMIC：缓解通用多模态检索中的视觉模态崩塌，避免语义错位	multimodal
11	Context Unrolling in Omni Models	Omni：通过上下文展开实现多模态统一建模与推理	multimodal
12	TEMA: Anchor the Image, Follow the Text for Multi-Modification Composed Image Retrieval	提出TEMA框架，解决多重修改组合图像检索中的实体覆盖不足和错位问题。	multimodal	✅
13	From Codebooks to VLMs: Evaluating Automated Visual Discourse Analysis for Climate Change on Social Media	评估视觉语言模型在社交媒体气候变化讨论分析中的应用	chain-of-thought	✅
14	TEMA: Anchor the Image, Follow the Text for Multi-Modification Composed Image Retrieval	提出TEMA框架，解决多重修改组合图像检索中的实体覆盖不足和错位问题。	multimodal	✅

🔬 支柱二：RL算法与架构 (RL & Architecture) (6 篇)

#	题目	一句话要点	标签	🔗	⭐
15	S1-VL: Scientific Multimodal Reasoning Model with Thinking-with-Images	S1-VL：融合科学推理与图像交互的多模态模型，提升科学领域问题求解能力。	reinforcement learning multimodal chain-of-thought
16	Latent Denoising Improves Visual Alignment in Large Multimodal Models	提出基于隐空间去噪的视觉对齐方法，提升大型多模态模型性能	distillation multimodal	✅
17	WorldMark: A Unified Benchmark Suite for Interactive Video World Models	WorldMark：统一交互式视频世界模型评测基准，实现公平模型对比	world model world models
18	Seeing Fast and Slow: Learning the Flow of Time in Videos	提出时序流学习框架，实现视频时序感知的速度估计、控制与超分辨率重建。	world model world models multimodal
19	VFM$^{4}$SDG: Unveiling the Power of VFMs for Single-Domain Generalized Object Detection	提出VFM$^{4}$SDG，利用视觉基础模型提升单域泛化目标检测的跨域稳定性	representation learning distillation foundation model
20	UAU-Net: Uncertainty-aware Representation Learning and Evidential Classification for Facial Action Unit Detection	提出UAU-Net，通过不确定性建模提升面部动作单元检测的鲁棒性和可靠性。	representation learning

🔬 支柱一：机器人控制 (Robot Control) (4 篇)

#	题目	一句话要点	标签	🔗	⭐
21	Do MLLMs Understand Pointing? Benchmarking and Enhancing Referential Reasoning in Egocentric Vision	提出EgoPoint-Bench基准，提升MLLM在第一人称视觉中基于指向的引用理解能力	sim-to-real egocentric egocentric vision	✅
22	LatRef-Diff: Latent and Reference-Guided Diffusion for Facial Attribute Editing and Style Manipulation	LatRef-Diff：基于潜在空间和参考引导的扩散模型，用于面部属性编辑和风格迁移	manipulation
23	Exploring the Role of Synthetic Data Augmentation in Controllable Human-Centric Video Generation	提出基于扩散模型的框架，探索合成数据在可控人体视频生成中的作用。	sim2real embodied AI
24	Rethinking Cross-Domain Evaluation for Face Forgery Detection with Semantic Fine-grained Alignment and Mixture-of-Experts	提出基于语义细粒度对齐和混合专家模型的SFAM框架，提升人脸伪造检测的跨域泛化能力。	manipulation

🔬 支柱八：物理动画 (Physics-based Animation) (4 篇)

#	题目	一句话要点	标签	🔗	⭐
25	Reshoot-Anything: A Self-Supervised Model for In-the-Wild Video Reshooting	提出Reshoot-Anything，解决野外视频重拍中多视角数据稀缺问题。	spatiotemporal
26	Sculpt4D: Generating 4D Shapes via Sparse-Attention Diffusion Transformers	Sculpt4D：通过稀疏注意力扩散Transformer生成高质量4D动态形状	spatiotemporal
27	Sparse Forcing: Native Trainable Sparse Attention for Real-time Autoregressive Diffusion Video Generation	提出Sparse Forcing，加速自回归扩散视频生成，提升长时序生成质量。	spatiotemporal
28	Reshoot-Anything: A Self-Supervised Model for In-the-Wild Video Reshooting	提出Reshoot-Anything，一种自监督模型，用于在真实场景中进行视频重拍摄。	spatiotemporal

🔬 支柱七：动作重定向 (Motion Retargeting) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
29	Encoder-Free Human Motion Understanding via Structured Motion Descriptions	提出结构化运动描述(SMD)，无需编码器即可实现人体运动理解。	human motion large language model	✅
30	SpatiO: Adaptive Test-Time Orchestration of Vision-Language Agents for Spatial Reasoning	提出SpatiO框架，通过测试时编排视觉-语言Agent解决空间推理问题。	spatial relationship

🔬 支柱六：视频提取与匹配 (Video Extraction) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
31	OmniFit: Multi-modal 3D Body Fitting via Scale-agnostic Dense Landmark Prediction	OmniFit：通过尺度无关的稠密地标预测实现多模态3D人体拟合	SMPL SMPL-X
32	EgoMAGIC- An Egocentric Video Field Medicine Dataset for Training Perception Algorithms	EgoMAGIC：用于训练感知算法的以自我为中心的医疗视频数据集	egocentric

⬅️ 返回 cs.CV 首页 · 🏠 返回主页