cs.CV（2026-04-22）

📊 共 37 篇论文 | 🔗 13 篇有代码

🎯 兴趣领域导航

支柱二：RL算法与架构 (RL & Architecture) (13 🔗5) 支柱九：具身大模型 (Embodied Foundation Models) (11 🔗3) 支柱三：空间感知与语义 (Perception & Semantics) (7 🔗3) 支柱一：机器人控制 (Robot Control) (4 🔗1) 支柱七：动作重定向 (Motion Retargeting) (1) 支柱八：物理动画 (Physics-based Animation) (1 🔗1)

🔬 支柱二：RL算法与架构 (RL & Architecture) (13 篇)

#	题目	一句话要点	标签	🔗	⭐
1	GSCompleter: A Distillation-Free Plugin for Metric-Aware 3D Gaussian Splatting Completion in Seconds	GSCompleter：用于度量感知3D高斯溅射补全的无蒸馏插件	distillation 3D gaussian splatting 3DGS
2	LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model	LLaDA2.0-Uni：基于扩散大语言模型的统一多模态理解与生成框架	distillation large language model foundation model	✅
3	SSL-R1: Self-Supervised Visual Reinforcement Post-Training for Multimodal Large Language Models	提出SSL-R1，通过自监督强化后训练提升多模态大语言模型的视觉理解能力。	reinforcement learning reward design large language model	✅
4	CCTVBench: Contrastive Consistency Traffic VideoQA Benchmark for Multimodal LLMs	CCTVBench：用于多模态LLM的对比一致性交通视频问答基准	world model world models multimodal
5	GeoRect4D: Geometry-Compatible Generative Rectification for Dynamic Sparse-View 3D Reconstruction	提出GeoRect4D以解决动态稀疏视图3D重建问题	distillation 3DGS 3D reconstruction
6	Hybrid Latent Reasoning with Decoupled Policy Optimization	提出HyLaR框架，通过解耦策略优化实现多模态大语言模型的混合隐式推理。	reinforcement learning large language model multimodal	✅
7	X-Cache: Cross-Chunk Block Caching for Few-Step Autoregressive World Models Inference	提出X-Cache以解决少步自回归世界模型推理的缓存效率问题	reinforcement learning world model world models
8	Beyond ZOH: Advanced Discretization Strategies for Vision Mamba	针对Vision Mamba，提出高级离散化策略以提升动态视觉环境下的时间保真度。	Mamba SSM state space model
9	UniCVR: From Alignment to Reranking for Unified Zero-Shot Composed Visual Retrieval	提出UniCVR，统一零样本组合视觉检索框架，解决图像、视频检索任务。	contrastive learning large language model multimodal
10	Semi-Supervised Flow Matching for Mosaiced and Panchromatic Fusion Imaging	提出半监督流匹配方法，用于马赛克高光谱与全色图像融合	flow matching HSI
11	MambaLiteUNet: Cross-Gated Adaptive Feature Fusion for Robust Skin Lesion Segmentation	提出MambaLiteUNet，通过跨门控自适应特征融合实现鲁棒的皮肤病灶分割	Mamba state space model	✅
12	Video-ToC: Video Tree-of-Cue Reasoning	提出Video-ToC，通过线索树推理增强视频大语言模型的理解能力。	reinforcement learning large language model	✅
13	LaplacianFormer:Rethinking Linear Attention with Laplacian Kernel	LaplacianFormer：提出基于拉普拉斯核的线性注意力机制，提升Transformer在高分辨率视觉任务中的性能。	linear attention

🔬 支柱九：具身大模型 (Embodied Foundation Models) (11 篇)

#	题目	一句话要点	标签	🔗	⭐
14	Render-in-the-Loop: Vector Graphics Generation via Visual Self-Feedback	提出Render-in-the-Loop，通过视觉自反馈提升矢量图形生成质量	large language model foundation model multimodal
15	The Expense of Seeing: Attaining Trustworthy Multimodal Reasoning Within the Monolithic Paradigm	揭示视觉语言模型中“视觉代价”：提出可信多模态推理的评估与改进框架	multimodal
16	From Scene to Object: Text-Guided Dual-Gaze Prediction	提出DualGaze-VLM，解决自动驾驶中文本引导下的细粒度驾驶员注意力预测问题	large language model multimodal
17	Exploring Spatial Intelligence from a Generative Perspective	提出GSI-Bench以评估生成空间智能能力	large language model multimodal
18	WildFireVQA: A Large-Scale Radiometric Thermal VQA Benchmark for Aerial Wildfire Monitoring	提出WildFireVQA，一个大规模的用于空中野火监测的辐射热VQA基准。	large language model multimodal	✅
19	R-CoV: Region-Aware Chain-of-Verification for Alleviating Object Hallucinations in LVLMs	提出R-CoV，通过区域感知链式验证缓解LVLM中的对象幻觉问题	multimodal	✅
20	Evian: Towards Explainable Visual Instruction-tuning Data Auditing	提出EVIAN框架以解决视觉指令调优数据审计问题	instruction following
21	From Image to Music Language: A Two-Stage Structure Decoding Approach for Complex Polyphonic OMR	提出双阶段结构解码方法，用于复杂复调乐谱的光学音乐识别	multimodal
22	Object Referring-Guided Scanpath Prediction with Perception-Enhanced Vision-Language Models	提出ScanVLA模型，利用感知增强的视觉-语言模型解决目标指代引导的眼动轨迹预测问题	multimodal
23	Rethinking Where to Edit: Task-Aware Localization for Instruction-Based Image Editing	提出任务感知编辑定位框架，解决指令图像编辑中的过度编辑问题	instruction following
24	IMPACT-CYCLE: A Contract-Based Multi-Agent System for Claim-Level Supervisory Correction of Long-Video Semantic Memory	提出IMPACT-CYCLE，通过基于合约的多智能体系统实现长视频语义记忆的声明级监督校正。	multimodal	✅

🔬 支柱三：空间感知与语义 (Perception & Semantics) (7 篇)

#	题目	一句话要点	标签	🔗	⭐
25	LEXIS: LatEnt ProXimal Interaction Signatures for 3D HOI from an Image	LEXIS：利用潜在近邻交互特征进行单目图像3D人-物交互重建	scene understanding physically plausible VQ-VAE	✅
26	SurgCoT: Advancing Spatiotemporal Reasoning in Surgical Videos through a Chain-of-Thought Benchmark	SurgCoT：构建手术视频时空推理链式思考基准，提升多模态大语言模型性能	affordance spatiotemporal large language model	✅
27	SpaCeFormer: Fast Proposal-Free Open-Vocabulary 3D Instance Segmentation	SpaCeFormer：快速无Proposal的开放词汇3D实例分割	open-vocabulary open vocabulary foundation model
28	MAPRPose: Mask-Aware Proposal and Amodal Refinement for Multi-Object 6D Pose Estimation	MAPRPose：利用掩码感知和模态补全的多目标6D位姿估计	6D pose estimation
29	Image Generators are Generalist Vision Learners	Vision Banana：图像生成器通过指令微调成为通用视觉学习器，达到SOTA性能	depth estimation metric depth Depth Anything
30	FurnSet: Exploiting Repeats for 3D Scene Reconstruction	FurnSet：利用重复实例进行单视角三维场景重建，提升重建质量。	scene reconstruction
31	Semantic-Fast-SAM: Efficient Semantic Segmenter	提出Semantic-Fast-SAM，结合FastSAM与语义标注流水线，实现实时高精度语义分割。	open-vocabulary open vocabulary	✅

🔬 支柱一：机器人控制 (Robot Control) (4 篇)

#	题目	一句话要点	标签	🔗	⭐
32	Stability-Driven Motion Generation for Object-Guided Human-Human Co-Manipulation	提出基于稳定性的运动生成框架，用于物体引导的人-人协同操作	manipulation flow matching affordance	✅
33	DeVI: Physics-based Dexterous Human-Object Interaction via Synthetic Video Imitation	DeVI：基于合成视频模仿的物理可信灵巧人机交互	manipulation dexterous hand dexterous manipulation
34	ProMMSearchAgent: A Generalizable Multimodal Search Agent Trained with Process-Oriented Rewards	提出ProMMSearchAgent，通过过程导向奖励训练通用多模态搜索Agent	sim-to-real reinforcement learning policy learning
35	Learning Spatial-Temporal Coherent Correlations for Speech-Preserving Facial Expression Manipulation	提出时空一致相关性学习算法，解决语音保持的面部表情操控问题。	manipulation

🔬 支柱七：动作重定向 (Motion Retargeting) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
36	HumanScore: Benchmarking Human Motions in Generated Videos	HumanScore：用于评估AI生成视频中人体运动质量的系统性评测框架	human motion

🔬 支柱八：物理动画 (Physics-based Animation) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
37	DynamicRad: Content-Adaptive Sparse Attention for Long Video Diffusion	DynamicRad：面向长视频扩散的内容自适应稀疏注意力加速方法	spatiotemporal	✅

⬅️ 返回 cs.CV 首页 · 🏠 返回主页