cs.CV(2026-05-20)

📊 共 52 篇论文 | 🔗 15 篇有代码

🎯 兴趣领域导航

支柱二:RL算法与架构 (RL & Architecture) (17 🔗2) 支柱九:具身大模型 (Embodied Foundation Models) (15 🔗5) 支柱三:空间感知与语义 (Perception & Semantics) (10 🔗3) 支柱四:生成式动作 (Generative Motion) (4 🔗2) 支柱六:视频提取与匹配 (Video Extraction) (2 🔗1) 支柱八:物理动画 (Physics-based Animation) (2 🔗1) 支柱七:动作重定向 (Motion Retargeting) (1 🔗1) 支柱一:机器人控制 (Robot Control) (1)

🔬 支柱二:RL算法与架构 (RL & Architecture) (17 篇)

#题目一句话要点标签🔗
1 IndusAgent: Reinforcing Open-Vocabulary Industrial Anomaly Detection with Agentic Tools 提出IndusAgent,利用工具增强Agent解决开放词汇工业异常检测问题。 reinforcement learning open-vocabulary open vocabulary
2 SpectralEarth-FM: Bringing Hyperspectral Imagery into Multimodal Earth Observation Pretraining 提出SpectralEarth-FM,用于高光谱影像与多模态地球观测数据的联合预训练。 JEPA Joint-Embedding Predictive Architecture joint-embedding predictive architecture
3 Multimodal LLMs under Pairwise Modalities 提出基于模态对的多模态大语言模型训练框架,提升跨模态性能 representation learning contrastive learning large language model
4 DriveMA: Rethinking Language Interfaces in Driving VLAs with One-Step Meta-Actions DriveMA:用单步元动作重塑驾驶VLA中的语言接口 reinforcement learning vision-language-action VLA
5 Linear-DPO: Linear Direct Preference Optimization for Diffusion and Flow-Matching Generative Models 提出Linear-DPO,通过线性效用函数优化扩散模型和Flow-Matching生成模型。 flow matching DPO direct preference optimization
6 3D Reconstruction and Knowledge Distillation to Improve Multi-View Image Models to Explore Spike Volume Estimation in Wheat 提出基于3D重建与知识蒸馏的多视角图像小麦穗体积估计方法 MAE distillation 3D reconstruction
7 VISTA: Technical Report for the Ego4D Short-Term Object Interaction Anticipation at EgoVis 2026 VISTA:用于Ego4D短时物体交互预测的V-JEPA集成时序预测器 JEPA human-object interaction egocentric
8 QwenSafe: Multimodal Content Rating Description Identification via Preference-Aligned VLMs QwenSafe:利用偏好对齐的视觉语言模型进行多模态内容分级描述识别 DPO direct preference optimization multimodal
9 Distill to Think, Foresee to Act: Cognitive-Physical Reinforcement Learning for Autonomous Driving 提出CoPhy认知-物理强化学习框架,提升自动驾驶安全性和意图理解。 reinforcement learning imitation learning world model
10 ProCrit: Self-Elicited Multi-Perspective Reasoning with Critic-Guided Revision for Multimodal Sarcasm Detection 提出ProCrit框架,通过自激多视角推理和评论引导修正,提升多模态讽刺检测性能。 reinforcement learning multimodal
11 RCGDet3D: Rethinking 4D Radar-Camera Fusion-based 3D Object Detection with Enhanced Radar Feature Encoding RCGDet3D:通过增强雷达特征编码,提升4D雷达-相机融合的3D目标检测性能 representation learning gaussian splatting splatting
12 Deformba: Vision State Space Model with Adaptive State Fusion Deformba:基于自适应状态融合的视觉状态空间模型,提升视觉任务性能。 SSM state space model
13 One-Step Distillation of Discrete Diffusion Image Generators via Fixed-Point Iteration 提出固定点蒸馏(FPD)框架,实现离散扩散图像生成器单步高效蒸馏。 distillation
14 Latent Dynamics for Full Body Avatar Animation 提出基于Transformer和动态残差潜变量的全身Avatar动画方法,提升服装细节和时间连贯性。 latent dynamics
15 GSA-YOLO: A High-Efficiency Framework via Structured Sparsity and Adaptive Knowledge Distillation for Real-Time X-ray Security Inspection GSA-YOLO:面向X射线安检的结构稀疏与自适应知识蒸馏高效框架 distillation
16 Q-ARVD: Quantizing Autoregressive Video Diffusion Models Q-ARVD:提出一种新的量化框架,用于加速自回归视频扩散模型的推理。 world model world models
17 JFAA: Technical Report for the EPIC-KITCHENS-100 Action Anticipation Challenge at EgoVis 2026 提出基于JEPA的JFAA方法,在EgoVis 2026的EK-100动作预测挑战赛中获得第一名 JEPA representation learning

🔬 支柱九:具身大模型 (Embodied Foundation Models) (15 篇)

#题目一句话要点标签🔗
18 Grounding Driving VLA via Inverse Kinematics 通过逆运动学增强Driving VLA的视觉 grounding 能力,提升轨迹预测性能 VLA visual grounding
19 ProtoPathway: Biologically Structured Prototype-Pathway Fusion for Multimodal Cancer Survival Prediction ProtoPathway:用于多模态癌症生存预测的生物结构化原型-通路融合方法 multimodal
20 Local-sensitive connectivity filter (ls-cf): A post-processing unsupervised improvement of the frangi, hessian and vesselness filters for multimodal vessel segmentation 提出局部敏感连接滤波器(LS-CF),用于无监督后处理血管分割,提升Frangi等滤波器的性能。 multimodal
21 FruitEnsemble: MLLM-Guided Arbitration for Heterogeneous ensemble in Fine-Grained Fruit Recognition FruitEnsemble:MLLM引导的异构集成方法,用于细粒度水果识别 large language model multimodal chain-of-thought
22 HDMoE: A Hierarchical Decoupling-Fusion Mixture-of-Experts Framework for Multimodal Cancer Survival Prediction 提出HDMoE框架,用于解决多模态癌症生存预测中冗余信息和细粒度关系建模问题。 multimodal
23 SAVER: Selective As-Needed Vision Evidence for Multimodal Information Extraction SAVER:针对多模态信息抽取的选择性按需视觉证据方法 multimodal
24 TextSculptor: Training and Benchmarking Scene Text Editing TextSculptor:构建场景文本编辑数据集与基准,提升开源模型性能。 large language model multimodal
25 VIHD: Visual Intervention-based Hallucination Detection for Medical Visual Question Answering 提出VIHD,通过视觉干预检测医学VQA中多模态大语言模型的幻觉问题。 large language model multimodal
26 VISTAQA: Benchmarking Joint Visual Question Answering and Pixel-Level Evidence VISTAQA:一个用于联合评估视觉问答和像素级证据的新基准。 large language model multimodal
27 Pareto-Enhanced Portrait Generation: Vision-Aligned Text Supervision for Alignment, Realism, and Aesthetics 提出Pareto优化的肖像生成方法,通过视觉对齐的文本监督提升对齐性、真实性和美学质量。 foundation model multimodal
28 Uni-Edit: Intelligent Editing Is A General Task For Unified Model Tuning 提出Uni-Edit:一种用于统一模型调优的智能图像编辑通用任务 multimodal
29 RoadTones: Tone Controllable Text Generation from Road Event Videos 提出RoadTones数据集与模型,实现道路事件视频中语气可控的文本生成。 chain-of-thought
30 STiTch: Semantic Transition and Transportation in Collaboration for Training-Free Zero-Shot Composed Image Retrieval 提出STiTch框架,解决免训练零样本组合图像检索中的语义鸿沟和组合多样性问题。 multimodal
31 VersusQ: Pairwise Margin Reasoning for Generalizable Video Quality Assessment VersusQ:基于成对边际推理的通用视频质量评估框架 multimodal
32 RISE: Reliable Improvement in Self-Evolving Vision-Language Models 提出RISE框架,提升视觉-语言模型自进化学习的可靠性和效率 multimodal

🔬 支柱三:空间感知与语义 (Perception & Semantics) (10 篇)

#题目一句话要点标签🔗
33 RelWitness: Open-Vocabulary 3D Scene Graph Generation with Visual-Geometric Relation Witnesses 提出RelWitness框架,利用视觉几何关系线索解决开放词汇3D场景图生成中的关系标注不完整问题。 open-vocabulary open vocabulary
34 AIR: Amortized Image Reconstruction Framework for Self-Supervised Feed-Forward 2D Gaussian Splatting 提出AIR:一种自监督前馈2D高斯溅射图像重建框架 gaussian splatting splatting
35 Bridging Structure and Language: Graph-Based Visual Reasoning for Autonomous Road Understanding 提出基于图结构的视觉推理框架CRS,提升自动驾驶场景下的道路理解能力。 scene understanding open-vocabulary open vocabulary
36 Towards Physically Consistent 4D Scene Reconstruction for Closed-loop Autonomous Driving Simulation 提出正交投影梯度与时序正则化,实现物理一致的4D自动驾驶场景重建。 3DGS scene reconstruction
37 HyDAR-Pano3D: A Hybrid Disentangled Anatomical Recovery Framework for Panoramic-to-3D Reconstruction HyDAR-Pano3D:用于全景X光片到3D重建的混合解耦解剖恢复框架 3D reconstruction
38 Stream3D: Sequential Multi-View 3D Generation via Evidential Memory 提出Stream3D,通过证据记忆实现单目视频流的连续多视角3D生成。 sam 3D SAM 3D
39 FlowLong: Inference-time Long Video Generation via Manifold-constrained Tweedie Matching 提出FlowLong以解决长视频生成的质量与一致性问题 3DGS
40 ArchSIBench: Benchmarking the Architectural Spatial Intelligence of Vision-Language Models 提出ArchSIBench,用于评估视觉-语言模型在建筑空间智能方面的能力 scene understanding
41 Sketch2MinSurf: Vision-Language Guided Generation of Editable Minimal Surfaces from Hand-Drawn Sketches 提出Sketch2MinSurf,通过视觉-语言引导生成可编辑的极小曲面 NeRF
42 $Δ$ynamics: Language-Based Representation for Inferring Rigid-Body Dynamics From Videos 提出$Δ$YNAMICS,利用语言表示从视频中推断刚体动力学,提升泛化性。 optical flow

🔬 支柱四:生成式动作 (Generative Motion) (4 篇)

#题目一句话要点标签🔗
43 DrawMotion: Generating 3D Human Motions by Freehand Drawing DrawMotion:提出一种基于手绘草图的3D人体动作生成扩散框架,提升用户控制性和效率。 text-to-motion motion generation human motion
44 RePCM: Region-Specific and Phenotype-Adaptive Bi-Ventricular Cardiac Motion Synthesis 提出RePCM,用于解决心血管疾病中区域特异性和表型自适应的双心室心脏运动合成问题。 motion synthesis
45 CHOIR: Contact-aware 4D Hand-Object Interaction Reconstruction CHOIR:提出接触感知的4D手-物交互重建框架,从单目视频中提取可复用的交互原语。 contact-aware HOI
46 DAMA: Disentangled Body-Anchored Gaussians for Controllable Multi-Layered Avatars 提出DAMA,通过解耦的身体锚定高斯模型实现可控的多层服装人体重建。 physically plausible penetration SMPL

🔬 支柱六:视频提取与匹配 (Video Extraction) (2 篇)

#题目一句话要点标签🔗
47 OSGNet with MLLM Reranking @ Ego4D Episodic Memory Challenge 2026 提出基于MLLM重排序的OSGNet,解决Ego4D情景记忆挑战中的时序定位问题 egocentric Ego4D large language model
48 Map-Mono-Ego: Map-Grounded Global Human Pose Estimation from Monocular Egocentric Video 提出Map-Mono-Ego以解决单目视角下人类姿态估计问题 egocentric

🔬 支柱八:物理动画 (Physics-based Animation) (2 篇)

#题目一句话要点标签🔗
49 Preserve, Reveal, Expand: Faithful 4D Video Editing with Region-Aware Conditioning 提出PREX框架,通过区域感知条件控制实现忠实的4D视频编辑 spatiotemporal
50 DarkShake-DVS: Event-based Human Action Recognition under Low-light andShaking Camera Conditions 提出EIS-HAR,解决低光照和抖动相机下的事件相机人体行为识别问题 spatiotemporal

🔬 支柱七:动作重定向 (Motion Retargeting) (1 篇)

#题目一句话要点标签🔗
51 AttriStory: Fine-grained Attribute Realization for Visual Storytelling with Diffusion Models AttriStory:利用扩散模型实现视觉故事中细粒度属性控制 latent optimization large language model

🔬 支柱一:机器人控制 (Robot Control) (1 篇)

#题目一句话要点标签🔗
52 Comparative Evaluation of Deep Learning Models for Fake Image Detection 对比评估深度学习模型在伪图像检测中的性能,VGG16取得最高准确率。 manipulation

⬅️ 返回 cs.CV 首页 · 🏠 返回主页