cs.CV(2026-04-22)

📊 共 37 篇论文 | 🔗 13 篇有代码

🎯 兴趣领域导航

支柱二:RL算法与架构 (RL & Architecture) (13 🔗5) 支柱九:具身大模型 (Embodied Foundation Models) (11 🔗3) 支柱三:空间感知与语义 (Perception & Semantics) (7 🔗3) 支柱一:机器人控制 (Robot Control) (4 🔗1) 支柱七:动作重定向 (Motion Retargeting) (1) 支柱八:物理动画 (Physics-based Animation) (1 🔗1)

🔬 支柱二:RL算法与架构 (RL & Architecture) (13 篇)

#题目一句话要点标签🔗
1 GSCompleter: A Distillation-Free Plugin for Metric-Aware 3D Gaussian Splatting Completion in Seconds GSCompleter:用于度量感知3D高斯溅射补全的无蒸馏插件 distillation 3D gaussian splatting 3DGS
2 LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model LLaDA2.0-Uni:基于扩散大语言模型的统一多模态理解与生成框架 distillation large language model foundation model
3 SSL-R1: Self-Supervised Visual Reinforcement Post-Training for Multimodal Large Language Models 提出SSL-R1,通过自监督强化后训练提升多模态大语言模型的视觉理解能力。 reinforcement learning reward design large language model
4 CCTVBench: Contrastive Consistency Traffic VideoQA Benchmark for Multimodal LLMs CCTVBench:用于多模态LLM的对比一致性交通视频问答基准 world model world models multimodal
5 GeoRect4D: Geometry-Compatible Generative Rectification for Dynamic Sparse-View 3D Reconstruction 提出GeoRect4D以解决动态稀疏视图3D重建问题 distillation 3DGS 3D reconstruction
6 Hybrid Latent Reasoning with Decoupled Policy Optimization 提出HyLaR框架,通过解耦策略优化实现多模态大语言模型的混合隐式推理。 reinforcement learning large language model multimodal
7 X-Cache: Cross-Chunk Block Caching for Few-Step Autoregressive World Models Inference 提出X-Cache以解决少步自回归世界模型推理的缓存效率问题 reinforcement learning world model world models
8 Beyond ZOH: Advanced Discretization Strategies for Vision Mamba 针对Vision Mamba,提出高级离散化策略以提升动态视觉环境下的时间保真度。 Mamba SSM state space model
9 UniCVR: From Alignment to Reranking for Unified Zero-Shot Composed Visual Retrieval 提出UniCVR,统一零样本组合视觉检索框架,解决图像、视频检索任务。 contrastive learning large language model multimodal
10 Semi-Supervised Flow Matching for Mosaiced and Panchromatic Fusion Imaging 提出半监督流匹配方法,用于马赛克高光谱与全色图像融合 flow matching HSI
11 MambaLiteUNet: Cross-Gated Adaptive Feature Fusion for Robust Skin Lesion Segmentation 提出MambaLiteUNet,通过跨门控自适应特征融合实现鲁棒的皮肤病灶分割 Mamba state space model
12 Video-ToC: Video Tree-of-Cue Reasoning 提出Video-ToC,通过线索树推理增强视频大语言模型的理解能力。 reinforcement learning large language model
13 LaplacianFormer:Rethinking Linear Attention with Laplacian Kernel LaplacianFormer:提出基于拉普拉斯核的线性注意力机制,提升Transformer在高分辨率视觉任务中的性能。 linear attention

🔬 支柱九:具身大模型 (Embodied Foundation Models) (11 篇)

#题目一句话要点标签🔗
14 Render-in-the-Loop: Vector Graphics Generation via Visual Self-Feedback 提出Render-in-the-Loop,通过视觉自反馈提升矢量图形生成质量 large language model foundation model multimodal
15 The Expense of Seeing: Attaining Trustworthy Multimodal Reasoning Within the Monolithic Paradigm 揭示视觉语言模型中“视觉代价”:提出可信多模态推理的评估与改进框架 multimodal
16 From Scene to Object: Text-Guided Dual-Gaze Prediction 提出DualGaze-VLM,解决自动驾驶中文本引导下的细粒度驾驶员注意力预测问题 large language model multimodal
17 Exploring Spatial Intelligence from a Generative Perspective 提出GSI-Bench以评估生成空间智能能力 large language model multimodal
18 WildFireVQA: A Large-Scale Radiometric Thermal VQA Benchmark for Aerial Wildfire Monitoring 提出WildFireVQA,一个大规模的用于空中野火监测的辐射热VQA基准。 large language model multimodal
19 R-CoV: Region-Aware Chain-of-Verification for Alleviating Object Hallucinations in LVLMs 提出R-CoV,通过区域感知链式验证缓解LVLM中的对象幻觉问题 multimodal
20 Evian: Towards Explainable Visual Instruction-tuning Data Auditing 提出EVIAN框架以解决视觉指令调优数据审计问题 instruction following
21 From Image to Music Language: A Two-Stage Structure Decoding Approach for Complex Polyphonic OMR 提出双阶段结构解码方法,用于复杂复调乐谱的光学音乐识别 multimodal
22 Object Referring-Guided Scanpath Prediction with Perception-Enhanced Vision-Language Models 提出ScanVLA模型,利用感知增强的视觉-语言模型解决目标指代引导的眼动轨迹预测问题 multimodal
23 Rethinking Where to Edit: Task-Aware Localization for Instruction-Based Image Editing 提出任务感知编辑定位框架,解决指令图像编辑中的过度编辑问题 instruction following
24 IMPACT-CYCLE: A Contract-Based Multi-Agent System for Claim-Level Supervisory Correction of Long-Video Semantic Memory 提出IMPACT-CYCLE,通过基于合约的多智能体系统实现长视频语义记忆的声明级监督校正。 multimodal

🔬 支柱三:空间感知与语义 (Perception & Semantics) (7 篇)

#题目一句话要点标签🔗
25 LEXIS: LatEnt ProXimal Interaction Signatures for 3D HOI from an Image LEXIS:利用潜在近邻交互特征进行单目图像3D人-物交互重建 scene understanding physically plausible VQ-VAE
26 SurgCoT: Advancing Spatiotemporal Reasoning in Surgical Videos through a Chain-of-Thought Benchmark SurgCoT:构建手术视频时空推理链式思考基准,提升多模态大语言模型性能 affordance spatiotemporal large language model
27 SpaCeFormer: Fast Proposal-Free Open-Vocabulary 3D Instance Segmentation SpaCeFormer:快速无Proposal的开放词汇3D实例分割 open-vocabulary open vocabulary foundation model
28 MAPRPose: Mask-Aware Proposal and Amodal Refinement for Multi-Object 6D Pose Estimation MAPRPose:利用掩码感知和模态补全的多目标6D位姿估计 6D pose estimation
29 Image Generators are Generalist Vision Learners Vision Banana:图像生成器通过指令微调成为通用视觉学习器,达到SOTA性能 depth estimation metric depth Depth Anything
30 FurnSet: Exploiting Repeats for 3D Scene Reconstruction FurnSet:利用重复实例进行单视角三维场景重建,提升重建质量。 scene reconstruction
31 Semantic-Fast-SAM: Efficient Semantic Segmenter 提出Semantic-Fast-SAM,结合FastSAM与语义标注流水线,实现实时高精度语义分割。 open-vocabulary open vocabulary

🔬 支柱一:机器人控制 (Robot Control) (4 篇)

#题目一句话要点标签🔗
32 Stability-Driven Motion Generation for Object-Guided Human-Human Co-Manipulation 提出基于稳定性的运动生成框架,用于物体引导的人-人协同操作 manipulation flow matching affordance
33 DeVI: Physics-based Dexterous Human-Object Interaction via Synthetic Video Imitation DeVI:基于合成视频模仿的物理可信灵巧人机交互 manipulation dexterous hand dexterous manipulation
34 ProMMSearchAgent: A Generalizable Multimodal Search Agent Trained with Process-Oriented Rewards 提出ProMMSearchAgent,通过过程导向奖励训练通用多模态搜索Agent sim-to-real reinforcement learning policy learning
35 Learning Spatial-Temporal Coherent Correlations for Speech-Preserving Facial Expression Manipulation 提出时空一致相关性学习算法,解决语音保持的面部表情操控问题。 manipulation

🔬 支柱七:动作重定向 (Motion Retargeting) (1 篇)

#题目一句话要点标签🔗
36 HumanScore: Benchmarking Human Motions in Generated Videos HumanScore:用于评估AI生成视频中人体运动质量的系统性评测框架 human motion

🔬 支柱八:物理动画 (Physics-based Animation) (1 篇)

#题目一句话要点标签🔗
37 DynamicRad: Content-Adaptive Sparse Attention for Long Video Diffusion DynamicRad:面向长视频扩散的内容自适应稀疏注意力加速方法 spatiotemporal

⬅️ 返回 cs.CV 首页 · 🏠 返回主页