cs.CV(2026-04-23)
📊 共 32 篇论文 | 🔗 9 篇有代码
🎯 兴趣领域导航
支柱三:空间感知与语义 (Perception & Semantics) (7 🔗3)
支柱九:具身大模型 (Embodied Foundation Models) (7 🔗3)
支柱二:RL算法与架构 (RL & Architecture) (6 🔗1)
支柱一:机器人控制 (Robot Control) (4 🔗1)
支柱八:物理动画 (Physics-based Animation) (4)
支柱七:动作重定向 (Motion Retargeting) (2 🔗1)
支柱六:视频提取与匹配 (Video Extraction) (2)
🔬 支柱三:空间感知与语义 (Perception & Semantics) (7 篇)
🔬 支柱九:具身大模型 (Embodied Foundation Models) (7 篇)
🔬 支柱二:RL算法与架构 (RL & Architecture) (6 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 15 | S1-VL: Scientific Multimodal Reasoning Model with Thinking-with-Images | S1-VL:融合科学推理与图像交互的多模态模型,提升科学领域问题求解能力。 | reinforcement learning multimodal chain-of-thought | ||
| 16 | Latent Denoising Improves Visual Alignment in Large Multimodal Models | 提出基于隐空间去噪的视觉对齐方法,提升大型多模态模型性能 | distillation multimodal | ✅ | |
| 17 | WorldMark: A Unified Benchmark Suite for Interactive Video World Models | WorldMark:统一交互式视频世界模型评测基准,实现公平模型对比 | world model world models | ||
| 18 | Seeing Fast and Slow: Learning the Flow of Time in Videos | 提出时序流学习框架,实现视频时序感知的速度估计、控制与超分辨率重建。 | world model world models multimodal | ||
| 19 | VFM$^{4}$SDG: Unveiling the Power of VFMs for Single-Domain Generalized Object Detection | 提出VFM$^{4}$SDG,利用视觉基础模型提升单域泛化目标检测的跨域稳定性 | representation learning distillation foundation model | ||
| 20 | UAU-Net: Uncertainty-aware Representation Learning and Evidential Classification for Facial Action Unit Detection | 提出UAU-Net,通过不确定性建模提升面部动作单元检测的鲁棒性和可靠性。 | representation learning |
🔬 支柱一:机器人控制 (Robot Control) (4 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 21 | Do MLLMs Understand Pointing? Benchmarking and Enhancing Referential Reasoning in Egocentric Vision | 提出EgoPoint-Bench基准,提升MLLM在第一人称视觉中基于指向的引用理解能力 | sim-to-real egocentric egocentric vision | ✅ | |
| 22 | LatRef-Diff: Latent and Reference-Guided Diffusion for Facial Attribute Editing and Style Manipulation | LatRef-Diff:基于潜在空间和参考引导的扩散模型,用于面部属性编辑和风格迁移 | manipulation | ||
| 23 | Exploring the Role of Synthetic Data Augmentation in Controllable Human-Centric Video Generation | 提出基于扩散模型的框架,探索合成数据在可控人体视频生成中的作用。 | sim2real embodied AI | ||
| 24 | Rethinking Cross-Domain Evaluation for Face Forgery Detection with Semantic Fine-grained Alignment and Mixture-of-Experts | 提出基于语义细粒度对齐和混合专家模型的SFAM框架,提升人脸伪造检测的跨域泛化能力。 | manipulation |
🔬 支柱八:物理动画 (Physics-based Animation) (4 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 25 | Reshoot-Anything: A Self-Supervised Model for In-the-Wild Video Reshooting | 提出Reshoot-Anything,解决野外视频重拍中多视角数据稀缺问题。 | spatiotemporal | ||
| 26 | Sculpt4D: Generating 4D Shapes via Sparse-Attention Diffusion Transformers | Sculpt4D:通过稀疏注意力扩散Transformer生成高质量4D动态形状 | spatiotemporal | ||
| 27 | Sparse Forcing: Native Trainable Sparse Attention for Real-time Autoregressive Diffusion Video Generation | 提出Sparse Forcing,加速自回归扩散视频生成,提升长时序生成质量。 | spatiotemporal | ||
| 28 | Reshoot-Anything: A Self-Supervised Model for In-the-Wild Video Reshooting | 提出Reshoot-Anything,一种自监督模型,用于在真实场景中进行视频重拍摄。 | spatiotemporal |
🔬 支柱七:动作重定向 (Motion Retargeting) (2 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 29 | Encoder-Free Human Motion Understanding via Structured Motion Descriptions | 提出结构化运动描述(SMD),无需编码器即可实现人体运动理解。 | human motion large language model | ✅ | |
| 30 | SpatiO: Adaptive Test-Time Orchestration of Vision-Language Agents for Spatial Reasoning | 提出SpatiO框架,通过测试时编排视觉-语言Agent解决空间推理问题。 | spatial relationship |
🔬 支柱六:视频提取与匹配 (Video Extraction) (2 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 31 | OmniFit: Multi-modal 3D Body Fitting via Scale-agnostic Dense Landmark Prediction | OmniFit:通过尺度无关的稠密地标预测实现多模态3D人体拟合 | SMPL SMPL-X | ||
| 32 | EgoMAGIC- An Egocentric Video Field Medicine Dataset for Training Perception Algorithms | EgoMAGIC:用于训练感知算法的以自我为中心的医疗视频数据集 | egocentric |