cs.CV(2026-04-22)
📊 共 37 篇论文 | 🔗 13 篇有代码
🎯 兴趣领域导航
支柱二:RL算法与架构 (RL & Architecture) (13 🔗5)
支柱九:具身大模型 (Embodied Foundation Models) (11 🔗3)
支柱三:空间感知与语义 (Perception & Semantics) (7 🔗3)
支柱一:机器人控制 (Robot Control) (4 🔗1)
支柱七:动作重定向 (Motion Retargeting) (1)
支柱八:物理动画 (Physics-based Animation) (1 🔗1)
🔬 支柱二:RL算法与架构 (RL & Architecture) (13 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 1 | GSCompleter: A Distillation-Free Plugin for Metric-Aware 3D Gaussian Splatting Completion in Seconds | GSCompleter:用于度量感知3D高斯溅射补全的无蒸馏插件 | distillation 3D gaussian splatting 3DGS | ||
| 2 | LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model | LLaDA2.0-Uni:基于扩散大语言模型的统一多模态理解与生成框架 | distillation large language model foundation model | ✅ | |
| 3 | SSL-R1: Self-Supervised Visual Reinforcement Post-Training for Multimodal Large Language Models | 提出SSL-R1,通过自监督强化后训练提升多模态大语言模型的视觉理解能力。 | reinforcement learning reward design large language model | ✅ | |
| 4 | CCTVBench: Contrastive Consistency Traffic VideoQA Benchmark for Multimodal LLMs | CCTVBench:用于多模态LLM的对比一致性交通视频问答基准 | world model world models multimodal | ||
| 5 | GeoRect4D: Geometry-Compatible Generative Rectification for Dynamic Sparse-View 3D Reconstruction | 提出GeoRect4D以解决动态稀疏视图3D重建问题 | distillation 3DGS 3D reconstruction | ||
| 6 | Hybrid Latent Reasoning with Decoupled Policy Optimization | 提出HyLaR框架,通过解耦策略优化实现多模态大语言模型的混合隐式推理。 | reinforcement learning large language model multimodal | ✅ | |
| 7 | X-Cache: Cross-Chunk Block Caching for Few-Step Autoregressive World Models Inference | 提出X-Cache以解决少步自回归世界模型推理的缓存效率问题 | reinforcement learning world model world models | ||
| 8 | Beyond ZOH: Advanced Discretization Strategies for Vision Mamba | 针对Vision Mamba,提出高级离散化策略以提升动态视觉环境下的时间保真度。 | Mamba SSM state space model | ||
| 9 | UniCVR: From Alignment to Reranking for Unified Zero-Shot Composed Visual Retrieval | 提出UniCVR,统一零样本组合视觉检索框架,解决图像、视频检索任务。 | contrastive learning large language model multimodal | ||
| 10 | Semi-Supervised Flow Matching for Mosaiced and Panchromatic Fusion Imaging | 提出半监督流匹配方法,用于马赛克高光谱与全色图像融合 | flow matching HSI | ||
| 11 | MambaLiteUNet: Cross-Gated Adaptive Feature Fusion for Robust Skin Lesion Segmentation | 提出MambaLiteUNet,通过跨门控自适应特征融合实现鲁棒的皮肤病灶分割 | Mamba state space model | ✅ | |
| 12 | Video-ToC: Video Tree-of-Cue Reasoning | 提出Video-ToC,通过线索树推理增强视频大语言模型的理解能力。 | reinforcement learning large language model | ✅ | |
| 13 | LaplacianFormer:Rethinking Linear Attention with Laplacian Kernel | LaplacianFormer:提出基于拉普拉斯核的线性注意力机制,提升Transformer在高分辨率视觉任务中的性能。 | linear attention |
🔬 支柱九:具身大模型 (Embodied Foundation Models) (11 篇)
🔬 支柱三:空间感知与语义 (Perception & Semantics) (7 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 25 | LEXIS: LatEnt ProXimal Interaction Signatures for 3D HOI from an Image | LEXIS:利用潜在近邻交互特征进行单目图像3D人-物交互重建 | scene understanding physically plausible VQ-VAE | ✅ | |
| 26 | SurgCoT: Advancing Spatiotemporal Reasoning in Surgical Videos through a Chain-of-Thought Benchmark | SurgCoT:构建手术视频时空推理链式思考基准,提升多模态大语言模型性能 | affordance spatiotemporal large language model | ✅ | |
| 27 | SpaCeFormer: Fast Proposal-Free Open-Vocabulary 3D Instance Segmentation | SpaCeFormer:快速无Proposal的开放词汇3D实例分割 | open-vocabulary open vocabulary foundation model | ||
| 28 | MAPRPose: Mask-Aware Proposal and Amodal Refinement for Multi-Object 6D Pose Estimation | MAPRPose:利用掩码感知和模态补全的多目标6D位姿估计 | 6D pose estimation | ||
| 29 | Image Generators are Generalist Vision Learners | Vision Banana:图像生成器通过指令微调成为通用视觉学习器,达到SOTA性能 | depth estimation metric depth Depth Anything | ||
| 30 | FurnSet: Exploiting Repeats for 3D Scene Reconstruction | FurnSet:利用重复实例进行单视角三维场景重建,提升重建质量。 | scene reconstruction | ||
| 31 | Semantic-Fast-SAM: Efficient Semantic Segmenter | 提出Semantic-Fast-SAM,结合FastSAM与语义标注流水线,实现实时高精度语义分割。 | open-vocabulary open vocabulary | ✅ |
🔬 支柱一:机器人控制 (Robot Control) (4 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 32 | Stability-Driven Motion Generation for Object-Guided Human-Human Co-Manipulation | 提出基于稳定性的运动生成框架,用于物体引导的人-人协同操作 | manipulation flow matching affordance | ✅ | |
| 33 | DeVI: Physics-based Dexterous Human-Object Interaction via Synthetic Video Imitation | DeVI:基于合成视频模仿的物理可信灵巧人机交互 | manipulation dexterous hand dexterous manipulation | ||
| 34 | ProMMSearchAgent: A Generalizable Multimodal Search Agent Trained with Process-Oriented Rewards | 提出ProMMSearchAgent,通过过程导向奖励训练通用多模态搜索Agent | sim-to-real reinforcement learning policy learning | ||
| 35 | Learning Spatial-Temporal Coherent Correlations for Speech-Preserving Facial Expression Manipulation | 提出时空一致相关性学习算法,解决语音保持的面部表情操控问题。 | manipulation |
🔬 支柱七:动作重定向 (Motion Retargeting) (1 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 36 | HumanScore: Benchmarking Human Motions in Generated Videos | HumanScore:用于评估AI生成视频中人体运动质量的系统性评测框架 | human motion |
🔬 支柱八:物理动画 (Physics-based Animation) (1 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 37 | DynamicRad: Content-Adaptive Sparse Attention for Long Video Diffusion | DynamicRad:面向长视频扩散的内容自适应稀疏注意力加速方法 | spatiotemporal | ✅ |