cs.CV(2026-05-14)

📊 共 67 篇论文 | 🔗 12 篇有代码

🎯 兴趣领域导航

支柱二:RL算法与架构 (RL & Architecture) (22 🔗6) 支柱九:具身大模型 (Embodied Foundation Models) (20 🔗2) 支柱一:机器人控制 (Robot Control) (9 🔗2) 支柱三:空间感知与语义 (Perception & Semantics) (9 🔗1) 支柱四:生成式动作 (Generative Motion) (2) 支柱七:动作重定向 (Motion Retargeting) (2) 支柱六:视频提取与匹配 (Video Extraction) (2 🔗1) 支柱八:物理动画 (Physics-based Animation) (1)

🔬 支柱二:RL算法与架构 (RL & Architecture) (22 篇)

#题目一句话要点标签🔗
1 Quantitative Video World Model Evaluation for Geometric-Consistency 提出PDI-Bench,用于量化评估视频生成模型在几何一致性方面的性能。 world model world models physically plausible
2 EARL: Towards a Unified Analysis-Guided Reinforcement Learning Framework for Egocentric Interaction Reasoning and Pixel Grounding 提出EARL框架,用于增强以自我为中心的交互推理和像素级定位 reinforcement learning egocentric egocentric vision
3 EponaV2: Driving World Model with Comprehensive Future Reasoning EponaV2:提出具备全面未来推理的驾驶世界模型,提升自动驾驶规划能力。 flow matching world model world models
4 SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer SANA-WM:高效分钟级世界模型,基于混合线性扩散Transformer world model world models linear attention
5 FactorizedHMR: A Hybrid Framework for Video Human Mesh Recovery FactorizedHMR:用于视频人体网格重建的混合框架,提升遮挡和弱深度下的鲁棒性 flow matching classifier-free guidance human mesh recovery
6 Vision-Core Guided Contrastive Learning for Balanced Multi-modal Prognosis Prediction of Stroke 提出Vision-Core引导的对比学习方法,用于平衡多模态卒中预后预测。 contrastive learning large language model multimodal
7 SceneParser: Hierarchical Scene Parsing for Visual Semantics Understanding 提出SceneParser,用于交互导向的层级场景解析,提升视觉语义理解 curriculum learning scene understanding open-vocabulary
8 MambaRain: Multi-Scale Mamba-Attention Framework for 0-3 Hour Precipitation Nowcasting MambaRain:结合Mamba和注意力机制的多尺度降水临近预报框架 Mamba representation learning spatiotemporal
9 Causal Forcing++: Scalable Few-Step Autoregressive Diffusion Distillation for Real-Time Interactive Video Generation 提出Causal Forcing++,实现帧级2步自回归扩散蒸馏,加速交互式视频生成。 world model world models distillation
10 Learning with Semantic Priors: Stabilizing Point-Supervised Infrared Small Target Detection via Hierarchical Knowledge Distillation 提出基于分层知识蒸馏的语义先验学习方法,稳定红外小目标点监督检测。 distillation foundation model
11 EverAnimate: Minute-Scale Human Animation via Latent Flow Restoration EverAnimate:通过潜在流恢复实现分钟级人物动画生成 flow matching human motion
12 SurgicalMamba: Dual-Path SSD with State Regramming for Online Surgical Phase Recognition SurgicalMamba:基于状态重编程的双路径SSD用于在线手术阶段识别 Mamba SSM
13 Unlocking Complex Visual Generation via Closed-Loop Verified Reasoning 提出CLVR框架,通过闭环验证推理提升复杂视觉生成效果 reinforcement learning distillation multimodal
14 RAVEN: Real-time Autoregressive Video Extrapolation with Consistency-model GRPO 提出RAVEN以解决长视频生成质量不足的问题 reinforcement learning distillation
15 Delta Forcing: Trust Region Steering for Interactive Autoregressive Video Generation 提出Delta Forcing,通过自适应信任域指导交互式自回归视频生成,提升时序一致性。 world model world models
16 KVPO: ODE-Native GRPO for Autoregressive Video Alignment via KV Semantic Exploration KVPO:基于KV语义探索的ODE原生GRPO,用于自回归视频对齐 reinforcement learning flow matching
17 PanoWorld: Geometry-Consistent Panoramic Video World Modeling PanoWorld:提出几何一致的全景视频世界建模方法,从单张图像和文本生成逼真全景视频。 world model world models geometric consistency
18 Entity-Centric World Models: Interaction-Aware Masking for Causal Video Prediction 提出交互感知掩码的IA-JEPA模型,用于提升因果视频预测的性能。 world model world models JEPA
19 EgoExo-WM: Unlocking Exo Video for Ego World Models EgoExo-WM:利用外视角视频增强自视角世界模型 world model world models egocentric
20 ReactiveGWM: Steering NPC in Reactive Game World Models 提出ReactiveGWM,实现游戏中可控NPC的反应式游戏世界建模。 world model world models
21 Social-Mamba: Socially-Aware Trajectory Forecasting with State-Space Models 提出Social-Mamba,利用状态空间模型高效预测人群轨迹,解决社交互动建模难题。 flow matching Mamba egocentric
22 Unlocking Complex Visual Generation via Closed-Loop Verified Reasoning 提出CLVR框架,通过闭环验证推理提升复杂视觉生成效果 reinforcement learning distillation multimodal

🔬 支柱九:具身大模型 (Embodied Foundation Models) (20 篇)

#题目一句话要点标签🔗
23 Think When Needed: Adaptive Reasoning-Driven Multimodal Embeddings with a Dual-LoRA Architecture 提出TWN:一种基于双LoRA架构的自适应推理多模态嵌入框架,提升效率和质量。 large language model multimodal chain-of-thought
24 MemLens: Benchmarking Multimodal Long-Term Memory in Large Vision-Language Models 提出MEMLENS基准,系统评估大型视觉语言模型在多模态长期记忆中的表现。 multimodal visual grounding
25 MemEye: A Visual-Centric Evaluation Framework for Multimodal Agent Memory MemEye:提出视觉中心的多模态Agent记忆评估框架,解决现有方法忽略细粒度视觉证据的问题。 multimodal
26 Do Composed Image Retrieval Benchmarks Require Multimodal Composition? 揭示组合图像检索基准测试中的单模态捷径问题,并提出更严格的评估方法。 multimodal
27 Diagnosing and Correcting Concept Omission in Multimodal Diffusion Transformers 提出OSI方法,通过增强遗漏信号显著改善多模态扩散模型中的概念遗漏问题 multimodal
28 Editor's Choice: Evaluating Abstract Intent in Image Editing through Atomic Entity Analysis 提出Entity-Rubrics框架与AbstractEdit基准,评估图像编辑中抽象意图的理解能力 multimodal instruction following
29 Do We Really Need External Tools to Mitigate Hallucinations? SIRA: Shared-Prefix Internal Reconstruction of Attribution 提出SIRA:一种无需外部工具缓解大模型幻觉的内部对比解码框架 multimodal visual grounding
30 TOPOS: High-Fidelity and Efficient Industry-Grade 3D Head Generation TOPOS:高保真、高效的工业级3D头部生成框架,满足固定拓扑需求。 large language model multimodal
31 Mitigating Mask Prior Drift and Positional Attention Collapse in Large Diffusion Vision-Language Models 针对大型扩散视觉-语言模型,提出Mask先验抑制和单调RoPE缩放,解决长文本生成中的重复生成和视觉 grounding 退化问题。 multimodal visual grounding
32 DermAgent: A Self-Reflective Agentic System for Dermatological Image Analysis with Multi-Tool Reasoning and Traceable Decision-Making DermAgent:一种自反思Agent系统,用于可追溯决策的皮肤病学图像分析 large language model multimodal
33 Articraft: An Agentic System for Scalable Articulated 3D Asset Generation Articraft:一种基于Agent的可扩展铰接3D资产生成系统 large language model
34 On the Cultural Anachronism and Temporal Reasoning in Vision Language Models 提出TAB-VLM基准测试,揭示VLM在文化遗产理解中的文化时代错误问题。 multimodal
35 Characterizing the visual representation of objects from the child's view 分析儿童视角视频,揭示早期视觉经验中物体表征的特点 multimodal
36 MHSA: A Lightweight Framework for Mitigating Hallucinations via Steered Attention in LVLMs 提出MHSA框架,通过引导注意力机制缓解大型视觉语言模型中的幻觉问题 multimodal
37 SteerSeg: Attention Steering for Reasoning Video Segmentation SteerSeg:通过注意力引导实现视频分割推理,提升LVLM空间定位能力 chain-of-thought
38 Exploring Vision-Language Models for Online Signature Verification: A Zero-Shot Capability Study 探索视觉-语言模型在在线签名验证中的零样本能力 chain-of-thought
39 MechVerse: Evaluating Physical Motion Consistency in Video Generation Models MechVerse:提出机械运动一致性评估基准,用于评估视频生成模型 instruction following
40 SceneForge: Structured World Supervision from 3D Interventions SceneForge:基于3D干预的可编辑场景结构化监督框架 multimodal
41 ELDOR: A Dataset and Benchmark for Illegal Gold Mining in the Amazon Rainforest ELDOR:亚马逊雨林非法金矿开采监测数据集与基准 foundation model multimodal
42 Deep Pre-Alignment for VLMs 提出深度预对齐(DPA)架构,解决视觉语言模型中的模态对齐难题。 large language model multimodal

🔬 支柱一:机器人控制 (Robot Control) (9 篇)

#题目一句话要点标签🔗
43 Real2Sim in HOI: Toward Physically Plausible HOI Reconstruction from Monocular Videos HA-HOI:从单目视频重建物理可信的人-物交互动画 humanoid manipulation real2sim
44 Evo-Depth: A Lightweight Depth-Enhanced Vision-Language-Action Model Evo-Depth:提出轻量级深度增强的视觉-语言-动作模型,提升机器人操作的空间理解能力。 manipulation spatial relationship vision-language-action
45 Breaking Dual Bottlenecks: Evolving Unified Multimodal Models into Self-Adaptive Interleaved Visual Reasoners 提出自适应交错视觉推理器,解决多模态统一模型在Anything-to-Image任务中的双重瓶颈 manipulation multimodal
46 PhyMotion: Structured 3D Motion Reward for Physics-Grounded Human Video Generation 提出PhyMotion:一种基于物理的结构化3D运动奖励,用于提升人体视频生成真实性。 humanoid reinforcement learning SMPL
47 DriveCtrl: Conditioned Sim-to-Real Driving Video Generation DriveCtrl:提出深度条件控制的Sim-to-Real驾驶视频生成框架,提升真实感和下游任务性能。 sim-to-real foundation model
48 ICED: Concept-level Machine Unlearning via Interpretable Concept Decomposition 提出概念级机器遗忘框架以解决视觉语言模型知识删除问题 manipulation large language model multimodal
49 CreFlow: Corrective Reflow for Sparse-Reward Embodied Video Diffusion RL CreFlow:用于稀疏奖励具身视频扩散强化学习的修正性重流方法 manipulation bi-manual bimanual manipulation
50 Systematic Discovery of Semantic Attacks in Online Map Construction through Conditional Diffusion MIRAGE:利用条件扩散模型在在线地图构建中发现语义攻击,绕过防御并注入虚假边界。 motion planning
51 ICED: Concept-level Machine Unlearning via Interpretable Concept Decomposition ICED:提出一种基于可解释概念分解的概念级机器遗忘方法,用于视觉-语言模型。 manipulation large language model multimodal

🔬 支柱三:空间感知与语义 (Perception & Semantics) (9 篇)

#题目一句话要点标签🔗
52 Efficient Dense Matching for Enhanced Gaussian Splatting Using AV1 Motion Vectors 利用AV1运动矢量加速3D高斯溅射,提升重建质量与效率 3D gaussian splatting 3DGS gaussian splatting
53 Denoising-GS: Gaussian Splatting with Spatial-aware Denoising Denoising-GS:基于空间感知去噪的高斯溅射方法 3D gaussian splatting 3DGS gaussian splatting
54 3D Skew-Normal Splatting 提出Skew-Normal Splatting,通过可学习偏度参数提升3D高斯溅射在非对称结构建模上的紧凑性和精度。 3D gaussian splatting 3DGS gaussian splatting
55 VGGT-$Ω$ VGGT-Ω:通过大规模训练和高效架构显著提升静态与动态场景重建精度。 VGGT vision-language-action
56 VGGT-Edit: Feed-forward Native 3D Scene Editing with Residual Field Prediction VGGT-Edit:提出基于残差场预测的前馈原生3D场景文本编辑方法 scene reconstruction VGGT
57 CalibAnyView: Beyond Single-View Camera Calibration in the Wild CalibAnyView:提出多视角相机自标定框架,提升野外场景几何感知能力 3D reconstruction geometric consistency
58 Towards Accurate Single Panoramic 3D Detection: A Semantic Gaussian Centric Approach 提出PanoGSDet,基于语义高斯表示实现精确单目全景3D目标检测 depth estimation scene understanding
59 TurboVGGT: Fast Visual Geometry Reconstruction with Adaptive Alternating Attention TurboVGGT:基于自适应交替注意力的快速视觉几何重建 3D reconstruction
60 3D Skew-Normal Splatting 提出Skew-Normal Splatting,通过可学习偏度参数提升3D高斯溅射在非对称结构场景下的重建质量。 3D gaussian splatting 3DGS gaussian splatting

🔬 支柱四:生成式动作 (Generative Motion) (2 篇)

#题目一句话要点标签🔗
61 Multi-scale Coarse-to-fine Modeling for Test-time Human Motion Control 提出MSCoT,一种用于测试时人体运动控制的多尺度粗到精模型。 text-to-motion motion synthesis motion generation
62 Multimodal Object Detection Under Sparse Forest-Canopy Occlusion 提出一种稀疏森林遮蔽下的多模态目标检测方法,用于提升复杂环境下的人员检测能力。 penetration multimodal

🔬 支柱七:动作重定向 (Motion Retargeting) (2 篇)

#题目一句话要点标签🔗
63 MultiEmo-Bench: Multi-label Visual Emotion Analysis for Multi-modal Large Language Models 提出MultiEmo-Bench多标签视觉情感分析基准,用于评估多模态大语言模型的情感理解能力。 motion prediction large language model multimodal
64 Analogical Trajectory Transfer 提出一种无训练的轨迹类比迁移方法,实现跨场景语义一致的运动轨迹转换。 human-to-robot foundation model

🔬 支柱六:视频提取与匹配 (Video Extraction) (2 篇)

#题目一句话要点标签🔗
65 ViMU: Benchmarking Video Metaphorical Understanding 提出ViMU基准,用于评估视频隐喻理解能力,弥合视频语义理解的差距。 HuMoR multimodal
66 Minerva-Ego: Spatiotemporal Hints for Egocentric Video Understanding Minerva-Ego:利用时空提示增强第一视角视频理解 egocentric spatiotemporal multimodal

🔬 支柱八:物理动画 (Physics-based Animation) (1 篇)

#题目一句话要点标签🔗
67 Local Spatiotemporal Convolutional Network for Robust Gait Recognition 提出局部时空卷积网络LSTCN,用于解决步态识别中运动模式提取难题。 spatiotemporal

⬅️ 返回 cs.CV 首页 · 🏠 返回主页