cs.CV（2026-04-20）

📊 共 47 篇论文 | 🔗 15 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (15 🔗7) 支柱三：空间感知与语义 (Perception & Semantics) (13 🔗2) 支柱二：RL算法与架构 (RL & Architecture) (8 🔗4) 支柱一：机器人控制 (Robot Control) (4 🔗1) 支柱八：物理动画 (Physics-based Animation) (2 🔗1) 支柱七：动作重定向 (Motion Retargeting) (2) 支柱六：视频提取与匹配 (Video Extraction) (2) 支柱四：生成式动作 (Generative Motion) (1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (15 篇)

#	题目	一句话要点	标签	🔗	⭐
1	Test-Time Perturbation Learning with Delayed Feedback for Vision-Language-Action Models	提出PDF：一种基于延迟反馈的测试时扰动学习方法，提升VLA模型在环境变化下的鲁棒性。	vision-language-action VLA multimodal	✅
2	AeroRAG: Structured Multimodal Retrieval-Augmented LLM for Fine-Grained Aerial Visual Reasoning	AeroRAG：面向精细化空中视觉推理的结构化多模态检索增强LLM	large language model multimodal
3	Revisiting Change VQA in Remote Sensing with Structured and Native Multimodal Qwen Models	利用结构化和原生多模态Qwen模型重新审视遥感影像变化VQA任务	multimodal
4	Mitigating Multimodal Hallucination via Phase-wise Self-reward	提出PSRD框架，通过阶段性自奖励机制缓解大型视觉语言模型中的多模态幻觉问题	multimodal
5	DifFoundMAD: Foundation Models meet Differential Morphing Attack Detection	DifFoundMAD：利用视觉基础模型进行高效人脸图像差分变脸攻击检测	foundation model
6	ZSG-IAD: A Multimodal Framework for Zero-Shot Grounded Industrial Anomaly Detection	提出ZSG-IAD，用于零样本条件下的工业异常检测，并提供可解释的缺陷定位。	multimodal
7	Prompting Foundation Models for Zero-Shot Ship Instance Segmentation in SAR Imagery	利用YOLOv11检测框提示SAM2，实现SAR图像零样本舰船实例分割	foundation model
8	OneDrive: Unified Multi-Paradigm Driving with Vision-Language-Action Models	提出OneDrive，利用视觉-语言-动作模型统一自动驾驶多范式任务。	vision-language-action	✅
9	EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations	提出EVE框架以解决多模态大语言模型自我进化问题	large language model multimodal	✅
10	Weakly-Supervised Referring Video Object Segmentation through Text Supervision	提出WSRVOS，仅用文本监督实现指代表达式引导的视频对象分割。	large language model multimodal	✅
11	Instruction-as-State: Environment-Guided and State-Conditioned Semantic Understanding for Embodied Navigation	提出S-EGIU框架，通过动态指令-感知纠缠提升具身导航性能	VLN
12	INTENT: Invariance and Discrimination-aware Noise Mitigation for Robust Composed Image Retrieval	提出INTENT网络，通过解耦模态噪声提升组合图像检索的鲁棒性	multimodal
13	HABIT: Chrono-Synergia Robust Progressive Learning Framework for Composed Image Retrieval	提出HABIT框架，解决Composed Image Retrieval中噪声三元组对应问题，提升检索鲁棒性。	multimodal	✅
14	From Heads to Neurons: Causal Attribution and Steering in Multi-Task Vision-Language Models	HONES：面向多任务视觉-语言模型，实现任务感知的神经元归因与调控	multimodal	✅
15	Source-Free Domain Adaptation with Vision-Language Prior	提出DIFO++方法，利用视觉-语言先验实现无源域自适应	multimodal	✅

🔬 支柱三：空间感知与语义 (Perception & Semantics) (13 篇)

#	题目	一句话要点	标签	🔗	⭐
16	E3VS-Bench: A Benchmark for Viewpoint-Dependent Active Perception in 3D Gaussian Splatting Scenes	E3VS-Bench：基于3D高斯溅射场景的视角依赖主动感知基准	3D gaussian splatting gaussian splatting splatting
17	GS-STVSR: Ultra-Efficient Continuous Spatio-Temporal Video Super-Resolution via 2D Gaussian Splatting	提出基于2D高斯溅射的GS-STVSR，实现超高效连续时空视频超分辨率	gaussian splatting splatting optical flow
18	Denoise and Align: Diffusion-Driven Foreground Knowledge Prompting for Open-Vocabulary Temporal Action Detection	提出DFAlign框架，利用扩散模型生成前景知识，提升开放词汇时序动作检测性能。	open-vocabulary open vocabulary
19	PCM-NeRF: Probabilistic Camera Modeling for Neural Radiance Fields under Pose Uncertainty	PCM-NeRF：针对位姿不确定性，提出基于概率相机模型的神经辐射场方法	NeRF neural radiance field
20	Voronoi-guided Bilateral 2D Gaussian Splatting for Arbitrary-Scale Hyperspectral Image Super-Resolution	提出GaussianHSI，利用Voronoi引导的双边高斯溅射实现任意尺度高光谱图像超分辨率	gaussian splatting splatting
21	MU-GeNeRF: Multi-view Uncertainty-guided Generalizable Neural Radiance Fields for Distractor-aware Scene	提出MU-GeNeRF，利用多视角不确定性指导的通用神经辐射场，解决场景中干扰物问题。	NeRF neural radiance field scene reconstruction
22	Geometry-Guided 3D Visual Token Pruning for Video-Language Models	提出Geo3DPruner，用于高效3D视觉语言模型中的几何引导3D视觉Token剪枝。	scene understanding large language model multimodal
23	T-REN: Learning Text-Aligned Region Tokens Improves Dense Vision-Language Alignment and Scalability	T-REN通过文本对齐区域令牌提升密集视觉-语言对齐和可扩展性	open-vocabulary open vocabulary Ego4D	✅
24	AI Approach for MRI-only Full-Spine Vertebral Segmentation and 3D Reconstruction in Paediatric Scoliosis	提出基于AI的MRI脊柱全自动分割与3D重建方法，用于儿童脊柱侧弯评估。	3D reconstruction
25	GeGS-PCR: Effective and Robust 3D Point Cloud Registration with Two-Stage Color-Enhanced Geometric-3DGS Fusion	提出GeGS-PCR，融合几何、颜色和高斯信息，解决低重叠和不完整点云配准难题。	3DGS
26	Towards Symmetry-sensitive Pose Estimation: A Rotation Representation for Symmetric Object Classes	提出对称感知姿态估计方法SARR，解决对称物体姿态估计中的方向模糊问题	6D pose estimation	✅
27	MEDN: Motion-Emotion Feature Decoupling Network for Micro-Expression Recognition	提出MEDN：一种用于微表情识别的运动-情感特征解耦网络	optical flow
28	Score-Based Matching with Target Guidance for Cryo-EM Denoising	提出基于Score匹配和目标引导的冷冻电镜图像去噪方法，提升结构一致性。	3D reconstruction

🔬 支柱二：RL算法与架构 (RL & Architecture) (8 篇)

#	题目	一句话要点	标签	🔗	⭐
29	XEmbodied: A Foundation Model with Enhanced Geometric and Physical Cues for Large-Scale Embodied Environments	XEmbodied：增强几何与物理线索的大规模具身环境基础模型	reinforcement learning occupancy grid affordance
30	Sharpening Lightweight Models for Generalized Polyp Segmentation: A Boundary Guided Distillation from Foundation Models	LiteBounD：通过边界引导蒸馏，增强轻量级模型在息肉分割中的泛化能力	distillation foundation model	✅
31	OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation	OneVL：基于视觉-语言解释的单步潜在推理与规划，提升自动驾驶轨迹预测效率。	world model world models VLA	✅
32	Beyond Binary Contrast: Modeling Continuous Skeleton Action Spaces with Transitional Anchors	TranCLR：利用过渡锚点建模连续骨骼动作空间，提升动作识别精度	contrastive learning human motion	✅
33	PlankFormer: Robust Plankton Instance Segmentation via MAE-Pretrained Vision Transformers and Pseudo Community Image Generation	PlankFormer：基于MAE预训练ViT和伪社区图像生成，实现鲁棒的浮游生物实例分割	masked autoencoder MAE
34	S2H-DPO: Hardness-Aware Preference Optimization for Vision-Language Models	提出S2H-DPO，增强视觉语言模型在多图推理中的全局搜索和对比能力	DPO
35	Soft Label Pruning and Quantization for Large-Scale Dataset Distillation	提出LPQLD方法，显著降低大规模数据集蒸馏中软标签的存储开销并提升精度。	distillation	✅
36	CanonSLR: Canonical-View Guided Multi-View Continuous Sign Language Recognition	提出CanonSLR，解决多视角连续手语识别中的视角鲁棒性问题。	teacher-student distillation

🔬 支柱一：机器人控制 (Robot Control) (4 篇)

#	题目	一句话要点	标签	🔗	⭐
37	Re$^2$MoGen: Open-Vocabulary Motion Generation via LLM Reasoning and Physics-Aware Refinement	Re$^2$MoGen：利用LLM推理和物理感知优化实现开放词汇运动生成	motion planning reinforcement learning open-vocabulary
38	SynAgent: Generalizable Cooperative Humanoid Manipulation via Solo-to-Cooperative Agent Synergy	SynAgent：通过单人到多人协同技能迁移实现通用人形机器人协同操作	humanoid manipulation PPO	✅
39	A Comparative Evaluation of Geometric Accuracy in NeRF and Gaussian Splatting	提出几何精度评估流程，对比NeRF与高斯溅射在机器人操作场景下的性能。	manipulation gaussian splatting splatting
40	MultiWorld: Scalable Multi-Agent Multi-View Video World Models	提出MultiWorld，实现可扩展的多智能体多视角视频世界模型	manipulation world model world models

🔬 支柱八：物理动画 (Physics-based Animation) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
41	Spatiotemporal Sycophancy: Negation-Based Gaslighting in Video Large Language Models	揭示视频大语言模型时空谄媚现象：基于否定诱导的气体照明攻击	spatiotemporal large language model visual grounding
42	Can LLM-Generated Text Empower Surgical Vision-Language Pre-training?	提出SurgLIME框架，利用LLM生成文本增强手术视觉-语言预训练，解决专家标注数据稀缺问题。	spatiotemporal large language model foundation model	✅

🔬 支柱七：动作重定向 (Motion Retargeting) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
43	Advancing Vision Transformer with Enhanced Spatial Priors	提出EVT：利用欧几里得距离增强空间先验的Vision Transformer	spatial relationship
44	Dual-stream Spatio-Temporal GCN-Transformer Network for 3D Human Pose Estimation	提出双流时空GCN-Transformer网络MixTGFormer，提升3D人体姿态估计精度。	spatial relationship

🔬 支柱六：视频提取与匹配 (Video Extraction) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
45	LiquidTAD: An Efficient Method for Temporal Action Detection via Liquid Neural Dynamics	LiquidTAD：利用并行化液态神经动力学高效解决时序动作检测问题	Ego4D
46	Ego-InBetween: Generating Object State Transitions in Ego-Centric Videos	EgoInBetween：提出EgoIn框架，用于生成以自我为中心的视频中物体状态过渡帧。	egocentric

🔬 支柱四：生成式动作 (Generative Motion) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
47	AnyLift: Scaling Motion Reconstruction from Internet Videos via 2D Diffusion	AnyLift：利用2D扩散模型从互联网视频中扩展运动重建，解决复杂运动和人-物交互问题。	motion diffusion model motion diffusion human-object interaction

⬅️ 返回 cs.CV 首页 · 🏠 返回主页