cs.CV(2026-04-23)

📊 共 32 篇论文 | 🔗 9 篇有代码

🎯 兴趣领域导航

支柱三:空间感知与语义 (Perception & Semantics) (7 🔗3) 支柱九:具身大模型 (Embodied Foundation Models) (7 🔗3) 支柱二:RL算法与架构 (RL & Architecture) (6 🔗1) 支柱一:机器人控制 (Robot Control) (4 🔗1) 支柱八:物理动画 (Physics-based Animation) (4) 支柱七:动作重定向 (Motion Retargeting) (2 🔗1) 支柱六:视频提取与匹配 (Video Extraction) (2)

🔬 支柱三:空间感知与语义 (Perception & Semantics) (7 篇)

#题目一句话要点标签🔗
1 DualSplat: Robust 3D Gaussian Splatting via Pseudo-Mask Bootstrapping from Reconstruction Failures DualSplat:利用重建失败的伪掩码引导,实现鲁棒的3D高斯溅射 3D gaussian splatting 3DGS gaussian splatting
2 You Only Gaussian Once: Controllable 3D Gaussian Splatting for Ultra-Densely Sampled Scenes YOGO:面向超密集场景的可控3D高斯溅射,弥合工业界与学术界差距 3D gaussian splatting 3DGS gaussian splatting
3 WildSplatter: Feed-forward 3D Gaussian Splatting with Appearance Control from Unconstrained Images WildSplatter:基于无约束图像和外观控制的前馈3D高斯溅射 3D gaussian splatting 3DGS gaussian splatting
4 Seeing Without Eyes: 4D Human-Scene Understanding from Wearable IMUs 提出IMU-to-4D框架,利用可穿戴IMU实现人体-场景4D重建,解决视觉依赖问题 scene understanding human motion spatiotemporal
5 Vista4D: Video Reshooting with 4D Point Clouds Vista4D:提出基于4D点云的视频重拍摄框架,提升动态视频的视角控制和视觉质量。 depth estimation
6 Unlocking the Power of Critical Factors for 3D Visual Geometry Estimation CARVE:通过关键因素分析与高分辨率增强,提升3D视觉几何估计性能 depth estimation
7 You Only Gaussian Once: Controllable 3D Gaussian Splatting for Ultra-Densely Sampled Scenes YOGO:面向超密集场景的可控3D高斯溅射,解决工业界应用难题 3D gaussian splatting 3DGS gaussian splatting

🔬 支柱九:具身大模型 (Embodied Foundation Models) (7 篇)

#题目一句话要点标签🔗
8 VG-CoT: Towards Trustworthy Visual Reasoning via Grounded Chain-of-Thought 提出VG-CoT数据集,通过视觉证据 grounding 提升LVLM的可信视觉推理能力 visual grounding chain-of-thought
9 Attention-based multiple instance learning for predominant growth pattern prediction in lung adenocarcinoma wsi using foundation models 提出基于注意力的多实例学习框架以预测肺腺癌生长模式 foundation model
10 MiMIC: Mitigating Visual Modality Collapse in Universal Multimodal Retrieval While Avoiding Semantic Misalignment MiMIC:缓解通用多模态检索中的视觉模态崩塌,避免语义错位 multimodal
11 Context Unrolling in Omni Models Omni:通过上下文展开实现多模态统一建模与推理 multimodal
12 TEMA: Anchor the Image, Follow the Text for Multi-Modification Composed Image Retrieval 提出TEMA框架,解决多重修改组合图像检索中的实体覆盖不足和错位问题。 multimodal
13 From Codebooks to VLMs: Evaluating Automated Visual Discourse Analysis for Climate Change on Social Media 评估视觉语言模型在社交媒体气候变化讨论分析中的应用 chain-of-thought
14 TEMA: Anchor the Image, Follow the Text for Multi-Modification Composed Image Retrieval 提出TEMA框架,解决多重修改组合图像检索中的实体覆盖不足和错位问题。 multimodal

🔬 支柱二:RL算法与架构 (RL & Architecture) (6 篇)

#题目一句话要点标签🔗
15 S1-VL: Scientific Multimodal Reasoning Model with Thinking-with-Images S1-VL:融合科学推理与图像交互的多模态模型,提升科学领域问题求解能力。 reinforcement learning multimodal chain-of-thought
16 Latent Denoising Improves Visual Alignment in Large Multimodal Models 提出基于隐空间去噪的视觉对齐方法,提升大型多模态模型性能 distillation multimodal
17 WorldMark: A Unified Benchmark Suite for Interactive Video World Models WorldMark:统一交互式视频世界模型评测基准,实现公平模型对比 world model world models
18 Seeing Fast and Slow: Learning the Flow of Time in Videos 提出时序流学习框架,实现视频时序感知的速度估计、控制与超分辨率重建。 world model world models multimodal
19 VFM$^{4}$SDG: Unveiling the Power of VFMs for Single-Domain Generalized Object Detection 提出VFM$^{4}$SDG,利用视觉基础模型提升单域泛化目标检测的跨域稳定性 representation learning distillation foundation model
20 UAU-Net: Uncertainty-aware Representation Learning and Evidential Classification for Facial Action Unit Detection 提出UAU-Net,通过不确定性建模提升面部动作单元检测的鲁棒性和可靠性。 representation learning

🔬 支柱一:机器人控制 (Robot Control) (4 篇)

#题目一句话要点标签🔗
21 Do MLLMs Understand Pointing? Benchmarking and Enhancing Referential Reasoning in Egocentric Vision 提出EgoPoint-Bench基准,提升MLLM在第一人称视觉中基于指向的引用理解能力 sim-to-real egocentric egocentric vision
22 LatRef-Diff: Latent and Reference-Guided Diffusion for Facial Attribute Editing and Style Manipulation LatRef-Diff:基于潜在空间和参考引导的扩散模型,用于面部属性编辑和风格迁移 manipulation
23 Exploring the Role of Synthetic Data Augmentation in Controllable Human-Centric Video Generation 提出基于扩散模型的框架,探索合成数据在可控人体视频生成中的作用。 sim2real embodied AI
24 Rethinking Cross-Domain Evaluation for Face Forgery Detection with Semantic Fine-grained Alignment and Mixture-of-Experts 提出基于语义细粒度对齐和混合专家模型的SFAM框架,提升人脸伪造检测的跨域泛化能力。 manipulation

🔬 支柱八:物理动画 (Physics-based Animation) (4 篇)

#题目一句话要点标签🔗
25 Reshoot-Anything: A Self-Supervised Model for In-the-Wild Video Reshooting 提出Reshoot-Anything,解决野外视频重拍中多视角数据稀缺问题。 spatiotemporal
26 Sculpt4D: Generating 4D Shapes via Sparse-Attention Diffusion Transformers Sculpt4D:通过稀疏注意力扩散Transformer生成高质量4D动态形状 spatiotemporal
27 Sparse Forcing: Native Trainable Sparse Attention for Real-time Autoregressive Diffusion Video Generation 提出Sparse Forcing,加速自回归扩散视频生成,提升长时序生成质量。 spatiotemporal
28 Reshoot-Anything: A Self-Supervised Model for In-the-Wild Video Reshooting 提出Reshoot-Anything,一种自监督模型,用于在真实场景中进行视频重拍摄。 spatiotemporal

🔬 支柱七:动作重定向 (Motion Retargeting) (2 篇)

#题目一句话要点标签🔗
29 Encoder-Free Human Motion Understanding via Structured Motion Descriptions 提出结构化运动描述(SMD),无需编码器即可实现人体运动理解。 human motion large language model
30 SpatiO: Adaptive Test-Time Orchestration of Vision-Language Agents for Spatial Reasoning 提出SpatiO框架,通过测试时编排视觉-语言Agent解决空间推理问题。 spatial relationship

🔬 支柱六:视频提取与匹配 (Video Extraction) (2 篇)

#题目一句话要点标签🔗
31 OmniFit: Multi-modal 3D Body Fitting via Scale-agnostic Dense Landmark Prediction OmniFit:通过尺度无关的稠密地标预测实现多模态3D人体拟合 SMPL SMPL-X
32 EgoMAGIC- An Egocentric Video Field Medicine Dataset for Training Perception Algorithms EgoMAGIC:用于训练感知算法的以自我为中心的医疗视频数据集 egocentric

⬅️ 返回 cs.CV 首页 · 🏠 返回主页