cs.CV(2026-05-18)

📊 共 75 篇论文 | 🔗 19 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (23 🔗7) 支柱二:RL算法与架构 (RL & Architecture) (19 🔗6) 支柱三:空间感知与语义 (Perception & Semantics) (18 🔗3) 支柱一:机器人控制 (Robot Control) (8 🔗2) 支柱六:视频提取与匹配 (Video Extraction) (3 🔗1) 支柱七:动作重定向 (Motion Retargeting) (2) 支柱八:物理动画 (Physics-based Animation) (2)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (23 篇)

#题目一句话要点标签🔗
1 Vision Inference Former: Sustaining Visual Consistency in Multimodal Large Language Models 提出Vision Inference Former (VIF),解决多模态大语言模型中视觉信息弱化问题。 large language model multimodal
2 StableVLA: Towards Robust Vision-Language-Action Models without Extra Data StableVLA:无需额外数据,提升视觉-语言-动作模型在真实视觉扰动下的鲁棒性 vision-language-action VLA
3 MementoGUI: Learning Agentic Multimodal Memory Control for Long-Horizon GUI Agents MementoGUI:面向长时GUI智能体的可学习Agentic多模态记忆控制框架 multimodal visual grounding
4 RAVE: Re-Allocating Visual Attention in Large Multimodal Models RAVE:通过重分配视觉注意力提升大型多模态模型性能 multimodal visual grounding
5 OmniSelect: Dynamic Modality-Aware Token Compression for Efficient Omni-modal Large Language Models 提出OmniSelect,用于高效OmniLLM的动态模态感知Token压缩 large language model multimodal
6 Semantic Generative Tuning for Unified Multimodal Models 提出语义生成调优(SGT)方法,通过图像分割任务提升统一多模态模型的理解和生成能力。 multimodal
7 Lance: Unified Multimodal Modeling by Multi-Task Synergy Lance:通过多任务协同实现统一的多模态建模,支持图像和视频的理解、生成与编辑。 multimodal
8 Speech-Guided Multimodal Learning for Vocal Tract Segmentation in Real-Time MRI 提出基于语音引导的多模态学习以解决实时MRI声道分割问题 multimodal
9 Cracks in the Foundation: A Civil Infrastructure Dataset to Challenge Vision Foundation Models 发布土木基础设施裂缝数据集CiF,揭示视觉基础模型在结构健康监测中的局限性。 foundation model
10 SkyNative: A Native Multimodal Framework for Remote Sensing Visual Evidence Reasoning SkyNative:一种用于遥感视觉证据推理的原生多模态框架 multimodal
11 Advancing Narrative Long Video Generation via Training-Free Identity-Aware Memory 提出IAMFlow,通过免训练的身份感知记忆框架,提升叙事性长视频生成的一致性。 large language model multimodal
12 CrossView Suite: Harnessing Cross-view Spatial Intelligence of MLLMs with Dataset, Model and Benchmark 提出CrossView Suite,提升MLLM跨视角空间智能,包含数据集、模型和评测基准。 large language model multimodal
13 OmniPro: A Comprehensive Benchmark for Omni-Proactive Streaming Video Understanding OmniPro:面向全主动流视频理解的综合性评测基准 large language model multimodal
14 A More Word-like Image Tokenization for MLLMs 提出解耦视觉Token化方法DiVT,提升MLLM对图像语义理解能力并降低计算成本。 large language model multimodal
15 SPIKE: An Adaptive Dual Controller Framework for Cost-Efficient Long-Horizon Game Agents SPIKE:一种自适应双控制器框架,用于高性价比的长程游戏智能体 multimodal
16 CATA: Continual Machine Unlearning via Conflict-Averse Task Arithmetic 提出CATA,通过冲突规避的任务算术实现视觉-语言模型的持续机器卸载 multimodal
17 What is Holding Back Latent Visual Reasoning? 揭示阻碍视觉推理模型中隐变量推理的关键因素,并提出改进方向 chain-of-thought
18 Enhancing Train-Free Infinite-Frame Generation for Consistent Long Videos 提出MIGA,增强无训练无限帧生成,实现一致性长视频生成 foundation model
19 SGSoft: Learning Fused Semantic-Geometric Features for 3D Shape Correspondence via Template-Guided Soft Signals SGSoft:通过模板引导的软信号学习融合语义-几何特征,实现3D形状对应 multimodal
20 What Matters for Grocery Product Retrieval with Open Source Vision Language Models 提出系统评估方法以提升杂货产品检索精度 multimodal
21 See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding SWIM:对齐视觉和语言表征,实现视频细粒度对象理解 multimodal
22 TinySAM 2: Extreme Memory Compression for Efficient Track Anything Model TinySAM 2:面向高效Track Anything模型的极低内存压缩 foundation model
23 CounterCount: A Diagnostic Framework for Counting Bias in Vision Language Models 提出CounterCount框架,诊断视觉语言模型中计数偏差问题 multimodal

🔬 支柱二:RL算法与架构 (RL & Architecture) (19 篇)

#题目一句话要点标签🔗
24 PanoWorld: A Generative Spatial World Model for Consistent Whole-House Panorama Synthesis PanoWorld:用于生成一致全屋全景图的生成式空间世界模型 world model world models 3D gaussian splatting
25 Vision-OPD: Learning to See Fine Details for Multimodal LLMs via On-Policy Self-Distillation Vision-OPD:通过On-Policy自蒸馏提升多模态LLM的细粒度视觉理解能力 distillation large language model multimodal
26 CodeBind: Decoupled Representation Learning for Multimodal Alignment with Unified Compositional Codebook CodeBind:通过解耦表示学习和统一组合码本实现多模态对齐 representation learning large language model multimodal
27 The MixCount Dataset: Bridging the Data Gap for Open-Vocabulary Object Counting 提出MixCount数据集以解决混合物体计数问题 MAE open-vocabulary open vocabulary
28 Vision Foundation Models as Generalist Tokenizers for Image Generation 提出VFMTok,一种基于视觉基础模型的通用图像Tokenizer,显著提升图像生成质量和效率。 contrastive learning classifier-free guidance foundation model
29 Incantation: Natural Language as the Action Interface for Multi-Entity Video World Models Incantation:提出自然语言作为多实体视频世界模型的动作接口 world model world models distillation
30 Xiaomi EV World Model: A Joint World Model Integrating Reconstruction and Generation for Autonomous Driving 小米提出JWM,融合重建与生成的世界模型,用于自动驾驶。 world model world models distillation
31 LatentUMM: Dual Latent Alignment for Unified Multimodal Models 提出LatentUMM,通过双重潜在空间对齐提升统一多模态模型的跨模态一致性。 latent dynamics multimodal
32 Semi-LAR: Semi-supervised Contrastive Learning with Linear Attention for Removal of Nighttime Flares 提出Semi-LAR半监督对比学习框架,有效去除夜间图像的镜头光晕 linear attention contrastive learning
33 LESSViT: Robust Hyperspectral Representation Learning under Spectral Configuration Shift LESSViT:一种鲁棒的高光谱表征学习方法,解决光谱配置偏移问题 representation learning masked autoencoder HSI
34 Leveraging Latent Visual Reasoning in Silence 提出基于注意力奖励的隐式视觉推理方法,提升多模态任务性能。 reinforcement learning multimodal visual grounding
35 HexagonalWarriorMamba: Superior Threshold-Dependent Multi-label Classification of 12-Lead ECG Cardiac Abnormalities 提出HexagonalWarriorMamba模型,提升12导联心电图多标签心脏异常分类性能 Mamba spatial relationship
36 Patch-MoE Mamba: A Patch-Ordered Mixture-of-Experts State Space Architecture for Medical Image Segmentation 提出Patch-MoE Mamba,用于提升医学图像分割性能。 Mamba state space model
37 WavFlow: Audio Generation in Waveform Space WavFlow:提出一种直接在波形空间生成音频的框架,无需中间表示。 flow matching multimodal
38 TIGER-FG: Text-Guided Implicit Fine-Grained Grounding for E-commerce Retrieval 提出TIGER-FG框架,利用文本引导的隐式细粒度 grounding 解决电商检索中的模态和粒度差异问题。 distillation multimodal
39 WinTok: A Win-Win Hybrid Tokenizer via Decomposing Visual Understanding and Generation with Transferable Tokens WinTok:解耦视觉理解与生成,实现双赢的混合型视觉Token化器 distillation foundation model
40 SAS: Semantic-aware Sampling for Generative Dataset Distillation SAS:利用语义感知采样进行生成式数据集精馏,提升精馏数据集的语义信息。 distillation
41 MoASE++: Mixture of Activation Sparsity Experts with Domain-Adaptive On-policy Distillation for Continual Test Time Adaptation 提出MoASE++,通过混合激活稀疏专家和领域自适应策略蒸馏,解决持续测试时自适应问题。 distillation
42 SafeDiffusion-R1: Online Reward Steering for Safe Diffusion Post-Training SafeDiffusion-R1:提出在线奖励引导的安全扩散模型后训练方法,无需监督数据。 reinforcement learning offline reinforcement learning

🔬 支柱三:空间感知与语义 (Perception & Semantics) (18 篇)

#题目一句话要点标签🔗
43 GaussianZoom: Progressive Zoom-in Generative 3D Gaussian Splatting with Geometric and Semantic Guidance 提出GaussianZoom以解决低分辨率输入下的高保真3D重建问题 3D gaussian splatting 3D reconstruction gaussian splatting
44 3D Skew Gaussian Splatting with Any Camera Trajectory Visualization Engine 提出3D Skew Gaussian Splatting,提升三维场景渲染质量与结构紧凑性,支持任意相机轨迹可视化。 3D gaussian splatting 3DGS gaussian splatting
45 RT-Splatting: Joint Reflection-Transmission Modeling with Gaussian Splatting RT-Splatting:用高斯溅射联合建模反射与透射,实现半透明物体高质量实时渲染 3D gaussian splatting 3DGS gaussian splatting
46 DSAA: Dual-Stage Attribute Activation for Fine-grained Open Vocabulary Detection 提出DSAA框架,通过双阶段属性激活增强细粒度开放词汇目标检测能力 open-vocabulary open vocabulary
47 Can These Views Be One Scene? Evaluating Multiview 3D Consistency when 3D Foundation Models Hallucinate 提出多视角3D一致性评估基准,揭示并缓解3D基础模型幻觉问题 VGGT foundation model
48 GeoFlow: Enforcing Implicit Geometric Consistency in Video Generation GeoFlow:通过几何一致性奖励提升视频生成中的时序稳定性 optical flow geometric consistency
49 PERL: Parameter Efficient Reasoning in CLIP Latent Space 提出PERL,通过CLIP隐空间中的参数高效推理实现视觉-语言模型的快速适应。 open-vocabulary open vocabulary multimodal
50 NeRF-based Spacecraft Reconstruction from Close-Range Monocular Imagery Under Illumination Variability and Pose Uncertainty 提出基于NeRF的航天器重建方法,解决光照变化和姿态不确定性问题 3D reconstruction NeRF neural radiance field
51 Efficient Sparse-to-Dense Visual Localization via Compact Gaussian Scene Representation and Accelerated Dense Pose Estimation LiteLoc:基于紧凑高斯场景表示和加速稠密位姿估计的高效稀疏到稠密视觉定位 3D gaussian splatting 3DGS gaussian splatting
52 SPATIOROUTE: Dynamic Prompt Routing for Zero-Shot Spatial Reasoning SpatioRoute:动态提示路由,用于零样本空间推理。 affordance egocentric chain-of-thought
53 Weakly Supervised Cross-Modal Learning for 4D Radar Scene Flow Estimation 提出弱监督跨模态学习框架,用于提升4D雷达场景流估计精度。 scene flow
54 UAVFF3D: A Geometry-Aware Benchmark for Feed-Forward UAV 3D Reconstruction 提出UAVFF3D无人机三维重建几何感知基准,提升前馈网络在无人机图像上的性能。 3D reconstruction
55 Why We Look Where We Look: Emergent Human-like Fixations of a Foveated Visual Language Model Maximizing Scene Understanding 基于注视视觉语言模型,通过最大化场景理解,涌现类人眼动模式 scene understanding
56 Efficient 3D Content Reconstruction and Generation 提出Instant3D和FastMap,加速3D内容生成与重建,应用于游戏、VR等领域。 3D reconstruction foundation model
57 Towards Universal Physical Adversarial Attacks via a Joint Multi-Objective and Multi-Model Optimization Framework 提出JMOF框架,解决物理对抗攻击中跨模型泛化性差的问题 depth estimation monocular depth
58 PIXLRelight: Controllable Relighting via Intrinsic Conditioning PIXLRelight:提出基于内参条件的单图可控光照重打方法 3D reconstruction
59 CATRF: Codec-Adaptive TriPlane Radiance Fields for Volumetric Content Delivery 提出CATRF框架以解决体积媒体传输带宽瓶颈问题 3DGS
60 Imaging Hidden Objects with Consumer LiDAR via Motion Induced Sampling 提出基于运动诱导采样的多帧融合策略,实现消费级LiDAR的非视距成像。 3D reconstruction

🔬 支柱一:机器人控制 (Robot Control) (8 篇)

#题目一句话要点标签🔗
61 StableHand: Quality-Aware Flow Matching for World-Space Dual-Hand Motion Estimation from Egocentric Video StableHand:基于质量感知的Flow Matching实现自中心视频中世界坐标双手的运动估计 bi-manual policy learning flow matching
62 Seeing Together:Multi-Robot Cooperative Egocentric Spatial Reasoning with Multimodal Large Language Models 提出SP-CoR框架,解决多机器人协同的动态空间推理难题 quadruped distillation egocentric
63 EgoInteract: Synthetic Egocentric Videos Generation for Interaction Understanding and Anticipation EgoInteract:用于交互理解和预测的合成第一人称视频生成 manipulation human-object interaction egocentric
64 InstructAV2AV: Instruction-Guided Audio-Video Joint Editing 提出InstructAV2AV,实现指令引导的音视频联合编辑,保证视听一致性。 manipulation instruction following
65 ESI-Bench: Towards Embodied Spatial Intelligence that Closes the Perception-Action Loop 提出ESI-BENCH基准,用于评估具身智能体在感知-行动闭环中的空间智能 locomotion manipulation
66 Non-Colliding Biometric Identities for Digital Entities: Geometry, Capacity, and Million-Scale Virtual Identity Provisioning 提出生物特征身份配置(BIP)框架,为数字实体提供百万级无冲突虚拟身份。 humanoid humanoid robot
67 AtlasVA: Self-Evolving Visual Skill Memory for Teacher-Free VLM Agents AtlasVA:面向免教师VLM代理的自进化视觉技能记忆框架 manipulation reinforcement learning
68 Threats to Arabic Handwriting Recognition: Investigating Black-Box Adversarial Attacks on embedded ConvNet models 揭示阿拉伯手写识别模型漏洞:针对嵌入式卷积网络的黑盒对抗攻击研究 manipulation

🔬 支柱六:视频提取与匹配 (Video Extraction) (3 篇)

#题目一句话要点标签🔗
69 DanceHMR: Hand-Aware Whole-Body Human Mesh Recovery from Monocular Videos DanceHMR:单目视频中手部感知的全身人体网格重建 human mesh recovery HMR SMPL
70 EgoExoMem: Cross-View Memory Reasoning over Synchronized Egocentric and Exocentric Videos 提出EgoExoMem基准,用于同步第一人称和第三人称视频的跨视角记忆推理。 egocentric
71 MARS: Technical Report for the CASTLE Challenge at EgoVis 2026 MARS:基于多模态Agent推理与源选择的EgoVis 2026 CASTLE挑战赛方案 egocentric multimodal

🔬 支柱七:动作重定向 (Motion Retargeting) (2 篇)

#题目一句话要点标签🔗
72 Code-as-Room: Generating 3D Rooms from Top-Down View Images via Agentic Code Synthesis Code-as-Room:基于Agentic代码合成,从俯视图生成3D房间 spatial relationship embodied AI
73 Functionalization via Structure Completion and Motion Rectification 提出基于图补全的物体功能化方法,实现3D模型结构补全与运动修正 motion prediction

🔬 支柱八:物理动画 (Physics-based Animation) (2 篇)

#题目一句话要点标签🔗
74 UST-Hand: An Uncertainty-aware Spatiotemporal Point Cloud Interaction Network for 3D Self-supervised Hand Pose Estimation UST-Hand:面向3D自监督手部姿态估计的不确定性感知时空点云交互网络 spatiotemporal
75 Temporal Aware Pruning for Efficient Diffusion-based Video Generation 提出TAPE:一种时序感知剪枝方法,用于高效扩散视频生成。 spatiotemporal

⬅️ 返回 cs.CV 首页 · 🏠 返回主页