cs.CV(2026-04-20)

📊 共 47 篇论文 | 🔗 15 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (15 🔗7) 支柱三:空间感知与语义 (Perception & Semantics) (13 🔗2) 支柱二:RL算法与架构 (RL & Architecture) (8 🔗4) 支柱一:机器人控制 (Robot Control) (4 🔗1) 支柱八:物理动画 (Physics-based Animation) (2 🔗1) 支柱七:动作重定向 (Motion Retargeting) (2) 支柱六:视频提取与匹配 (Video Extraction) (2) 支柱四:生成式动作 (Generative Motion) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (15 篇)

#题目一句话要点标签🔗
1 Test-Time Perturbation Learning with Delayed Feedback for Vision-Language-Action Models 提出PDF:一种基于延迟反馈的测试时扰动学习方法,提升VLA模型在环境变化下的鲁棒性。 vision-language-action VLA multimodal
2 AeroRAG: Structured Multimodal Retrieval-Augmented LLM for Fine-Grained Aerial Visual Reasoning AeroRAG:面向精细化空中视觉推理的结构化多模态检索增强LLM large language model multimodal
3 Revisiting Change VQA in Remote Sensing with Structured and Native Multimodal Qwen Models 利用结构化和原生多模态Qwen模型重新审视遥感影像变化VQA任务 multimodal
4 Mitigating Multimodal Hallucination via Phase-wise Self-reward 提出PSRD框架,通过阶段性自奖励机制缓解大型视觉语言模型中的多模态幻觉问题 multimodal
5 DifFoundMAD: Foundation Models meet Differential Morphing Attack Detection DifFoundMAD:利用视觉基础模型进行高效人脸图像差分变脸攻击检测 foundation model
6 ZSG-IAD: A Multimodal Framework for Zero-Shot Grounded Industrial Anomaly Detection 提出ZSG-IAD,用于零样本条件下的工业异常检测,并提供可解释的缺陷定位。 multimodal
7 Prompting Foundation Models for Zero-Shot Ship Instance Segmentation in SAR Imagery 利用YOLOv11检测框提示SAM2,实现SAR图像零样本舰船实例分割 foundation model
8 OneDrive: Unified Multi-Paradigm Driving with Vision-Language-Action Models 提出OneDrive,利用视觉-语言-动作模型统一自动驾驶多范式任务。 vision-language-action
9 EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations 提出EVE框架以解决多模态大语言模型自我进化问题 large language model multimodal
10 Weakly-Supervised Referring Video Object Segmentation through Text Supervision 提出WSRVOS,仅用文本监督实现指代表达式引导的视频对象分割。 large language model multimodal
11 Instruction-as-State: Environment-Guided and State-Conditioned Semantic Understanding for Embodied Navigation 提出S-EGIU框架,通过动态指令-感知纠缠提升具身导航性能 VLN
12 INTENT: Invariance and Discrimination-aware Noise Mitigation for Robust Composed Image Retrieval 提出INTENT网络,通过解耦模态噪声提升组合图像检索的鲁棒性 multimodal
13 HABIT: Chrono-Synergia Robust Progressive Learning Framework for Composed Image Retrieval 提出HABIT框架,解决Composed Image Retrieval中噪声三元组对应问题,提升检索鲁棒性。 multimodal
14 From Heads to Neurons: Causal Attribution and Steering in Multi-Task Vision-Language Models HONES:面向多任务视觉-语言模型,实现任务感知的神经元归因与调控 multimodal
15 Source-Free Domain Adaptation with Vision-Language Prior 提出DIFO++方法,利用视觉-语言先验实现无源域自适应 multimodal

🔬 支柱三:空间感知与语义 (Perception & Semantics) (13 篇)

#题目一句话要点标签🔗
16 E3VS-Bench: A Benchmark for Viewpoint-Dependent Active Perception in 3D Gaussian Splatting Scenes E3VS-Bench:基于3D高斯溅射场景的视角依赖主动感知基准 3D gaussian splatting gaussian splatting splatting
17 GS-STVSR: Ultra-Efficient Continuous Spatio-Temporal Video Super-Resolution via 2D Gaussian Splatting 提出基于2D高斯溅射的GS-STVSR,实现超高效连续时空视频超分辨率 gaussian splatting splatting optical flow
18 Denoise and Align: Diffusion-Driven Foreground Knowledge Prompting for Open-Vocabulary Temporal Action Detection 提出DFAlign框架,利用扩散模型生成前景知识,提升开放词汇时序动作检测性能。 open-vocabulary open vocabulary
19 PCM-NeRF: Probabilistic Camera Modeling for Neural Radiance Fields under Pose Uncertainty PCM-NeRF:针对位姿不确定性,提出基于概率相机模型的神经辐射场方法 NeRF neural radiance field
20 Voronoi-guided Bilateral 2D Gaussian Splatting for Arbitrary-Scale Hyperspectral Image Super-Resolution 提出GaussianHSI,利用Voronoi引导的双边高斯溅射实现任意尺度高光谱图像超分辨率 gaussian splatting splatting
21 MU-GeNeRF: Multi-view Uncertainty-guided Generalizable Neural Radiance Fields for Distractor-aware Scene 提出MU-GeNeRF,利用多视角不确定性指导的通用神经辐射场,解决场景中干扰物问题。 NeRF neural radiance field scene reconstruction
22 Geometry-Guided 3D Visual Token Pruning for Video-Language Models 提出Geo3DPruner,用于高效3D视觉语言模型中的几何引导3D视觉Token剪枝。 scene understanding large language model multimodal
23 T-REN: Learning Text-Aligned Region Tokens Improves Dense Vision-Language Alignment and Scalability T-REN通过文本对齐区域令牌提升密集视觉-语言对齐和可扩展性 open-vocabulary open vocabulary Ego4D
24 AI Approach for MRI-only Full-Spine Vertebral Segmentation and 3D Reconstruction in Paediatric Scoliosis 提出基于AI的MRI脊柱全自动分割与3D重建方法,用于儿童脊柱侧弯评估。 3D reconstruction
25 GeGS-PCR: Effective and Robust 3D Point Cloud Registration with Two-Stage Color-Enhanced Geometric-3DGS Fusion 提出GeGS-PCR,融合几何、颜色和高斯信息,解决低重叠和不完整点云配准难题。 3DGS
26 Towards Symmetry-sensitive Pose Estimation: A Rotation Representation for Symmetric Object Classes 提出对称感知姿态估计方法SARR,解决对称物体姿态估计中的方向模糊问题 6D pose estimation
27 MEDN: Motion-Emotion Feature Decoupling Network for Micro-Expression Recognition 提出MEDN:一种用于微表情识别的运动-情感特征解耦网络 optical flow
28 Score-Based Matching with Target Guidance for Cryo-EM Denoising 提出基于Score匹配和目标引导的冷冻电镜图像去噪方法,提升结构一致性。 3D reconstruction

🔬 支柱二:RL算法与架构 (RL & Architecture) (8 篇)

#题目一句话要点标签🔗
29 XEmbodied: A Foundation Model with Enhanced Geometric and Physical Cues for Large-Scale Embodied Environments XEmbodied:增强几何与物理线索的大规模具身环境基础模型 reinforcement learning occupancy grid affordance
30 Sharpening Lightweight Models for Generalized Polyp Segmentation: A Boundary Guided Distillation from Foundation Models LiteBounD:通过边界引导蒸馏,增强轻量级模型在息肉分割中的泛化能力 distillation foundation model
31 OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation OneVL:基于视觉-语言解释的单步潜在推理与规划,提升自动驾驶轨迹预测效率。 world model world models VLA
32 Beyond Binary Contrast: Modeling Continuous Skeleton Action Spaces with Transitional Anchors TranCLR:利用过渡锚点建模连续骨骼动作空间,提升动作识别精度 contrastive learning human motion
33 PlankFormer: Robust Plankton Instance Segmentation via MAE-Pretrained Vision Transformers and Pseudo Community Image Generation PlankFormer:基于MAE预训练ViT和伪社区图像生成,实现鲁棒的浮游生物实例分割 masked autoencoder MAE
34 S2H-DPO: Hardness-Aware Preference Optimization for Vision-Language Models 提出S2H-DPO,增强视觉语言模型在多图推理中的全局搜索和对比能力 DPO
35 Soft Label Pruning and Quantization for Large-Scale Dataset Distillation 提出LPQLD方法,显著降低大规模数据集蒸馏中软标签的存储开销并提升精度。 distillation
36 CanonSLR: Canonical-View Guided Multi-View Continuous Sign Language Recognition 提出CanonSLR,解决多视角连续手语识别中的视角鲁棒性问题。 teacher-student distillation

🔬 支柱一:机器人控制 (Robot Control) (4 篇)

#题目一句话要点标签🔗
37 Re$^2$MoGen: Open-Vocabulary Motion Generation via LLM Reasoning and Physics-Aware Refinement Re$^2$MoGen:利用LLM推理和物理感知优化实现开放词汇运动生成 motion planning reinforcement learning open-vocabulary
38 SynAgent: Generalizable Cooperative Humanoid Manipulation via Solo-to-Cooperative Agent Synergy SynAgent:通过单人到多人协同技能迁移实现通用人形机器人协同操作 humanoid manipulation PPO
39 A Comparative Evaluation of Geometric Accuracy in NeRF and Gaussian Splatting 提出几何精度评估流程,对比NeRF与高斯溅射在机器人操作场景下的性能。 manipulation gaussian splatting splatting
40 MultiWorld: Scalable Multi-Agent Multi-View Video World Models 提出MultiWorld,实现可扩展的多智能体多视角视频世界模型 manipulation world model world models

🔬 支柱八:物理动画 (Physics-based Animation) (2 篇)

#题目一句话要点标签🔗
41 Spatiotemporal Sycophancy: Negation-Based Gaslighting in Video Large Language Models 揭示视频大语言模型时空谄媚现象:基于否定诱导的气体照明攻击 spatiotemporal large language model visual grounding
42 Can LLM-Generated Text Empower Surgical Vision-Language Pre-training? 提出SurgLIME框架,利用LLM生成文本增强手术视觉-语言预训练,解决专家标注数据稀缺问题。 spatiotemporal large language model foundation model

🔬 支柱七:动作重定向 (Motion Retargeting) (2 篇)

#题目一句话要点标签🔗
43 Advancing Vision Transformer with Enhanced Spatial Priors 提出EVT:利用欧几里得距离增强空间先验的Vision Transformer spatial relationship
44 Dual-stream Spatio-Temporal GCN-Transformer Network for 3D Human Pose Estimation 提出双流时空GCN-Transformer网络MixTGFormer,提升3D人体姿态估计精度。 spatial relationship

🔬 支柱六:视频提取与匹配 (Video Extraction) (2 篇)

#题目一句话要点标签🔗
45 LiquidTAD: An Efficient Method for Temporal Action Detection via Liquid Neural Dynamics LiquidTAD:利用并行化液态神经动力学高效解决时序动作检测问题 Ego4D
46 Ego-InBetween: Generating Object State Transitions in Ego-Centric Videos EgoInBetween:提出EgoIn框架,用于生成以自我为中心的视频中物体状态过渡帧。 egocentric

🔬 支柱四:生成式动作 (Generative Motion) (1 篇)

#题目一句话要点标签🔗
47 AnyLift: Scaling Motion Reconstruction from Internet Videos via 2D Diffusion AnyLift:利用2D扩散模型从互联网视频中扩展运动重建,解决复杂运动和人-物交互问题。 motion diffusion model motion diffusion human-object interaction

⬅️ 返回 cs.CV 首页 · 🏠 返回主页