cs.CV（2026-04-16）

📊 共 29 篇论文 | 🔗 6 篇有代码

🎯 兴趣领域导航

支柱二：RL算法与架构 (RL & Architecture) (9) 支柱九：具身大模型 (Embodied Foundation Models) (9 🔗3) 支柱三：空间感知与语义 (Perception & Semantics) (5 🔗2) 支柱一：机器人控制 (Robot Control) (3) 支柱八：物理动画 (Physics-based Animation) (2 🔗1) 支柱六：视频提取与匹配 (Video Extraction) (1)

🔬 支柱二：RL算法与架构 (RL & Architecture) (9 篇)

#	题目	一句话要点	标签	🔗	⭐
1	RaTA-Tool: Retrieval-based Tool Selection with Multimodal Large Language Models	RaTA-Tool：基于检索的多模态大语言模型工具选择框架	DPO direct preference optimization large language model
2	HAMSA: Scanning-Free Vision State Space Models via SpectralPulseNet	HAMSA：通过SpectralPulseNet实现无扫描的视觉状态空间模型	Mamba SSM state space model
3	DETR-ViP: Detection Transformer with Robust Discriminative Visual Prompts	提出DETR-ViP，通过增强视觉提示的判别性，提升开放词汇目标检测性能	contrastive learning VIP distillation
4	Beyond Independent Frames: Latent Attention Masked Autoencoders for Multi-View Echocardiography	提出LAMAE，利用潜在注意力机制的掩码自编码器处理多视角超声心动图，提升心脏表征学习。	masked autoencoder MAE spatiotemporal
5	Integrating Object Detection, LiDAR-Enhanced Depth Estimation, and Segmentation Models for Railway Environments	提出铁路环境障碍物检测框架，融合目标检测、LiDAR增强深度估计和分割模型	MAE depth estimation monocular depth
6	RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework	RAD-2：一种生成器-判别器框架下的强化学习方法，提升自动驾驶运动规划的稳定性和安全性。	reinforcement learning imitation learning multimodal
7	Switch-KD: Visual-Switch Knowledge Distillation for Vision-Language Models	提出Visual-Switch知识蒸馏框架，解决视觉语言模型多模态知识对齐问题。	distillation multimodal
8	LeapAlign: Post-Training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories	LeapAlign：通过构建两步轨迹对Flow Matching模型进行任意生成步骤的后训练对齐。	flow matching
9	TurboTalk: Progressive Distillation for One-Step Audio-Driven Talking Avatar Generation	TurboTalk：用于一步式音频驱动说话人头像生成的渐进式蒸馏框架	distillation

🔬 支柱九：具身大模型 (Embodied Foundation Models) (9 篇)

#	题目	一句话要点	标签	🔗	⭐
10	MM-WebAgent: A Hierarchical Multimodal Web Agent for Webpage Generation	提出MM-WebAgent，通过分层规划和自反思，解决AIGC网页生成中风格不一致和全局连贯性差的问题。	multimodal
11	Robustness of Vision Foundation Models to Common Perturbations	系统性评估视觉基础模型对常见扰动的鲁棒性，并提出改进方法	foundation model
12	MapSR: Prompt-Driven Land Cover Map Super-Resolution via Vision Foundation Models	MapSR：基于视觉基础模型的提示驱动型地表覆盖超分辨率方法	foundation model	✅
13	G-MIXER: Geodesic Mixup-based Implicit Semantic Expansion and Explicit Semantic Re-ranking for Zero-Shot Composed Image Retrieval	提出G-MIXER，通过测地线混合和语义重排序解决零样本组合图像检索问题	large language model multimodal	✅
14	Chain of Modality: From Static Fusion to Dynamic Orchestration in Omni-MLLMs	提出Chain of Modality框架，解决Omni-MLLMs中静态融合导致的性能瓶颈问题	large language model multimodal
15	ControlFoley: Unified and Controllable Video-to-Audio Generation with Cross-Modal Conflict Handling	ControlFoley：提出统一可控的视频到音频生成框架，解决跨模态冲突问题	multimodal	✅
16	Prompt-to-Gesture: Measuring the Capabilities of Image-to-Video Deictic Gesture Generation	提出Prompt-to-Gesture框架，利用图像到视频生成模型合成手势数据，缓解手势识别数据稀缺问题。	foundation model
17	From Boundaries to Semantics: Prompt-Guided Multi-Task Learning for Petrographic Thin-section Segmentation	Petro-SAM：提出一种提示引导的多任务学习框架，用于岩相薄片分割	foundation model
18	Towards Design Compositing	提出GIST，实现设计元素风格统一化与无缝融合，提升设计美观度	multimodal

🔬 支柱三：空间感知与语义 (Perception & Semantics) (5 篇)

#	题目	一句话要点	标签	🔗	⭐
19	NG-GS: NeRF-Guided 3D Gaussian Splatting Segmentation	NG-GS：NeRF引导的3D高斯溅射分割，解决边界离散化问题	3D gaussian splatting 3DGS gaussian splatting	✅
20	GlobalSplat: Efficient Feed-Forward 3D Gaussian Splatting via Global Scene Tokens	GlobalSplat：通过全局场景令牌实现高效的前馈3D高斯溅射	3D gaussian splatting gaussian splatting splatting	✅
21	One-shot Compositional 3D Head Avatars with Deformable Hair	提出一种可变形头发的单图像组合式3D头部Avatar构建方法，提升动画真实感。	3D gaussian splatting 3DGS gaussian splatting
22	TokenGS: Decoupling 3D Gaussian Prediction from Pixels with Learnable Tokens	TokenGS：解耦像素与3D高斯预测，利用可学习Token实现高效场景重建	3D gaussian splatting 3DGS gaussian splatting
23	Hybrid Latents -- Geometry-Appearance-Aware Surfel Splatting	提出混合隐变量高斯溅射方法，提升多视角重建的几何与外观保真度。	splatting NeRF

🔬 支柱一：机器人控制 (Robot Control) (3 篇)

#	题目	一句话要点	标签	🔗	⭐
24	R3D: Revisiting 3D Policy Learning	R3D：通过引入3D数据增强和优化网络结构，提升3D策略学习的稳定性和泛化性	manipulation policy learning imitation learning
25	The Courtroom Trial of Pixels: Robust Image Manipulation Localization via Adversarial Evidence and Reinforcement Learning Judgment	提出基于对抗证据和强化学习判决的图像篡改定位方法，提升篡改区域识别鲁棒性。	manipulation reinforcement learning
26	Giving Faces Their Feelings Back: Explicit Emotion Control for Feedforward Single-Image 3D Head Avatars	提出一种显式情感控制的单图3D头像重建框架，实现可控情感迁移。	manipulation

🔬 支柱八：物理动画 (Physics-based Animation) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
27	Unsupervised Skeleton-Based Action Segmentation via Hierarchical Spatiotemporal Vector Quantization	提出一种基于分层时空向量量化的无监督骨骼动作分割方法	spatiotemporal TAMP
28	Revisiting Token Compression for Accelerating ViT-based Sparse Multi-View 3D Object Detectors	SEPatch3D：针对ViT稀疏多视角3D目标检测加速的动态Patch尺寸调整框架	spatiotemporal	✅

🔬 支柱六：视频提取与匹配 (Video Extraction) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
29	How to Correctly Make Mistakes: A Framework for Constructing and Benchmarking Mistake Aware Egocentric Procedural Videos	PIE-V框架：构建并评估面向错误感知的以自我为中心的程序视频	egocentric

⬅️ 返回 cs.CV 首页 · 🏠 返回主页