cs.CV(2026-04-16)

📊 共 29 篇论文 | 🔗 6 篇有代码

🎯 兴趣领域导航

支柱二:RL算法与架构 (RL & Architecture) (9) 支柱九:具身大模型 (Embodied Foundation Models) (9 🔗3) 支柱三:空间感知与语义 (Perception & Semantics) (5 🔗2) 支柱一:机器人控制 (Robot Control) (3) 支柱八:物理动画 (Physics-based Animation) (2 🔗1) 支柱六:视频提取与匹配 (Video Extraction) (1)

🔬 支柱二:RL算法与架构 (RL & Architecture) (9 篇)

#题目一句话要点标签🔗
1 RaTA-Tool: Retrieval-based Tool Selection with Multimodal Large Language Models RaTA-Tool:基于检索的多模态大语言模型工具选择框架 DPO direct preference optimization large language model
2 HAMSA: Scanning-Free Vision State Space Models via SpectralPulseNet HAMSA:通过SpectralPulseNet实现无扫描的视觉状态空间模型 Mamba SSM state space model
3 DETR-ViP: Detection Transformer with Robust Discriminative Visual Prompts 提出DETR-ViP,通过增强视觉提示的判别性,提升开放词汇目标检测性能 contrastive learning VIP distillation
4 Beyond Independent Frames: Latent Attention Masked Autoencoders for Multi-View Echocardiography 提出LAMAE,利用潜在注意力机制的掩码自编码器处理多视角超声心动图,提升心脏表征学习。 masked autoencoder MAE spatiotemporal
5 Integrating Object Detection, LiDAR-Enhanced Depth Estimation, and Segmentation Models for Railway Environments 提出铁路环境障碍物检测框架,融合目标检测、LiDAR增强深度估计和分割模型 MAE depth estimation monocular depth
6 RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework RAD-2:一种生成器-判别器框架下的强化学习方法,提升自动驾驶运动规划的稳定性和安全性。 reinforcement learning imitation learning multimodal
7 Switch-KD: Visual-Switch Knowledge Distillation for Vision-Language Models 提出Visual-Switch知识蒸馏框架,解决视觉语言模型多模态知识对齐问题。 distillation multimodal
8 LeapAlign: Post-Training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories LeapAlign:通过构建两步轨迹对Flow Matching模型进行任意生成步骤的后训练对齐。 flow matching
9 TurboTalk: Progressive Distillation for One-Step Audio-Driven Talking Avatar Generation TurboTalk:用于一步式音频驱动说话人头像生成的渐进式蒸馏框架 distillation

🔬 支柱九:具身大模型 (Embodied Foundation Models) (9 篇)

#题目一句话要点标签🔗
10 MM-WebAgent: A Hierarchical Multimodal Web Agent for Webpage Generation 提出MM-WebAgent,通过分层规划和自反思,解决AIGC网页生成中风格不一致和全局连贯性差的问题。 multimodal
11 Robustness of Vision Foundation Models to Common Perturbations 系统性评估视觉基础模型对常见扰动的鲁棒性,并提出改进方法 foundation model
12 MapSR: Prompt-Driven Land Cover Map Super-Resolution via Vision Foundation Models MapSR:基于视觉基础模型的提示驱动型地表覆盖超分辨率方法 foundation model
13 G-MIXER: Geodesic Mixup-based Implicit Semantic Expansion and Explicit Semantic Re-ranking for Zero-Shot Composed Image Retrieval 提出G-MIXER,通过测地线混合和语义重排序解决零样本组合图像检索问题 large language model multimodal
14 Chain of Modality: From Static Fusion to Dynamic Orchestration in Omni-MLLMs 提出Chain of Modality框架,解决Omni-MLLMs中静态融合导致的性能瓶颈问题 large language model multimodal
15 ControlFoley: Unified and Controllable Video-to-Audio Generation with Cross-Modal Conflict Handling ControlFoley:提出统一可控的视频到音频生成框架,解决跨模态冲突问题 multimodal
16 Prompt-to-Gesture: Measuring the Capabilities of Image-to-Video Deictic Gesture Generation 提出Prompt-to-Gesture框架,利用图像到视频生成模型合成手势数据,缓解手势识别数据稀缺问题。 foundation model
17 From Boundaries to Semantics: Prompt-Guided Multi-Task Learning for Petrographic Thin-section Segmentation Petro-SAM:提出一种提示引导的多任务学习框架,用于岩相薄片分割 foundation model
18 Towards Design Compositing 提出GIST,实现设计元素风格统一化与无缝融合,提升设计美观度 multimodal

🔬 支柱三:空间感知与语义 (Perception & Semantics) (5 篇)

#题目一句话要点标签🔗
19 NG-GS: NeRF-Guided 3D Gaussian Splatting Segmentation NG-GS:NeRF引导的3D高斯溅射分割,解决边界离散化问题 3D gaussian splatting 3DGS gaussian splatting
20 GlobalSplat: Efficient Feed-Forward 3D Gaussian Splatting via Global Scene Tokens GlobalSplat:通过全局场景令牌实现高效的前馈3D高斯溅射 3D gaussian splatting gaussian splatting splatting
21 One-shot Compositional 3D Head Avatars with Deformable Hair 提出一种可变形头发的单图像组合式3D头部Avatar构建方法,提升动画真实感。 3D gaussian splatting 3DGS gaussian splatting
22 TokenGS: Decoupling 3D Gaussian Prediction from Pixels with Learnable Tokens TokenGS:解耦像素与3D高斯预测,利用可学习Token实现高效场景重建 3D gaussian splatting 3DGS gaussian splatting
23 Hybrid Latents -- Geometry-Appearance-Aware Surfel Splatting 提出混合隐变量高斯溅射方法,提升多视角重建的几何与外观保真度。 splatting NeRF

🔬 支柱一:机器人控制 (Robot Control) (3 篇)

#题目一句话要点标签🔗
24 R3D: Revisiting 3D Policy Learning R3D:通过引入3D数据增强和优化网络结构,提升3D策略学习的稳定性和泛化性 manipulation policy learning imitation learning
25 The Courtroom Trial of Pixels: Robust Image Manipulation Localization via Adversarial Evidence and Reinforcement Learning Judgment 提出基于对抗证据和强化学习判决的图像篡改定位方法,提升篡改区域识别鲁棒性。 manipulation reinforcement learning
26 Giving Faces Their Feelings Back: Explicit Emotion Control for Feedforward Single-Image 3D Head Avatars 提出一种显式情感控制的单图3D头像重建框架,实现可控情感迁移。 manipulation

🔬 支柱八:物理动画 (Physics-based Animation) (2 篇)

#题目一句话要点标签🔗
27 Unsupervised Skeleton-Based Action Segmentation via Hierarchical Spatiotemporal Vector Quantization 提出一种基于分层时空向量量化的无监督骨骼动作分割方法 spatiotemporal TAMP
28 Revisiting Token Compression for Accelerating ViT-based Sparse Multi-View 3D Object Detectors SEPatch3D:针对ViT稀疏多视角3D目标检测加速的动态Patch尺寸调整框架 spatiotemporal

🔬 支柱六:视频提取与匹配 (Video Extraction) (1 篇)

#题目一句话要点标签🔗
29 How to Correctly Make Mistakes: A Framework for Constructing and Benchmarking Mistake Aware Egocentric Procedural Videos PIE-V框架:构建并评估面向错误感知的以自我为中心的程序视频 egocentric

⬅️ 返回 cs.CV 首页 · 🏠 返回主页