cs.CV（2026-05-14）

📊 共 67 篇论文 | 🔗 12 篇有代码

🎯 兴趣领域导航

支柱二：RL算法与架构 (RL & Architecture) (22 🔗6) 支柱九：具身大模型 (Embodied Foundation Models) (20 🔗2) 支柱一：机器人控制 (Robot Control) (9 🔗2) 支柱三：空间感知与语义 (Perception & Semantics) (9 🔗1) 支柱四：生成式动作 (Generative Motion) (2) 支柱七：动作重定向 (Motion Retargeting) (2) 支柱六：视频提取与匹配 (Video Extraction) (2 🔗1) 支柱八：物理动画 (Physics-based Animation) (1)

🔬 支柱二：RL算法与架构 (RL & Architecture) (22 篇)

#	题目	一句话要点	标签	🔗
1	Quantitative Video World Model Evaluation for Geometric-Consistency	提出PDI-Bench，用于量化评估视频生成模型在几何一致性方面的性能。	world model world models physically plausible	✅
2	EARL: Towards a Unified Analysis-Guided Reinforcement Learning Framework for Egocentric Interaction Reasoning and Pixel Grounding	提出EARL框架，用于增强以自我为中心的交互推理和像素级定位	reinforcement learning egocentric egocentric vision
3	EponaV2: Driving World Model with Comprehensive Future Reasoning	EponaV2：提出具备全面未来推理的驾驶世界模型，提升自动驾驶规划能力。	flow matching world model world models
4	SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer	SANA-WM：高效分钟级世界模型，基于混合线性扩散Transformer	world model world models linear attention
5	FactorizedHMR: A Hybrid Framework for Video Human Mesh Recovery	FactorizedHMR：用于视频人体网格重建的混合框架，提升遮挡和弱深度下的鲁棒性	flow matching classifier-free guidance human mesh recovery
6	Vision-Core Guided Contrastive Learning for Balanced Multi-modal Prognosis Prediction of Stroke	提出Vision-Core引导的对比学习方法，用于平衡多模态卒中预后预测。	contrastive learning large language model multimodal
7	SceneParser: Hierarchical Scene Parsing for Visual Semantics Understanding	提出SceneParser，用于交互导向的层级场景解析，提升视觉语义理解	curriculum learning scene understanding open-vocabulary
8	MambaRain: Multi-Scale Mamba-Attention Framework for 0-3 Hour Precipitation Nowcasting	MambaRain：结合Mamba和注意力机制的多尺度降水临近预报框架	Mamba representation learning spatiotemporal
9	Causal Forcing++: Scalable Few-Step Autoregressive Diffusion Distillation for Real-Time Interactive Video Generation	提出Causal Forcing++，实现帧级2步自回归扩散蒸馏，加速交互式视频生成。	world model world models distillation	✅
10	Learning with Semantic Priors: Stabilizing Point-Supervised Infrared Small Target Detection via Hierarchical Knowledge Distillation	提出基于分层知识蒸馏的语义先验学习方法，稳定红外小目标点监督检测。	distillation foundation model	✅
11	EverAnimate: Minute-Scale Human Animation via Latent Flow Restoration	EverAnimate：通过潜在流恢复实现分钟级人物动画生成	flow matching human motion
12	SurgicalMamba: Dual-Path SSD with State Regramming for Online Surgical Phase Recognition	SurgicalMamba：基于状态重编程的双路径SSD用于在线手术阶段识别	Mamba SSM	✅
13	Unlocking Complex Visual Generation via Closed-Loop Verified Reasoning	提出CLVR框架，通过闭环验证推理提升复杂视觉生成效果	reinforcement learning distillation multimodal
14	RAVEN: Real-time Autoregressive Video Extrapolation with Consistency-model GRPO	提出RAVEN以解决长视频生成质量不足的问题	reinforcement learning distillation
15	Delta Forcing: Trust Region Steering for Interactive Autoregressive Video Generation	提出Delta Forcing，通过自适应信任域指导交互式自回归视频生成，提升时序一致性。	world model world models
16	KVPO: ODE-Native GRPO for Autoregressive Video Alignment via KV Semantic Exploration	KVPO：基于KV语义探索的ODE原生GRPO，用于自回归视频对齐	reinforcement learning flow matching
17	PanoWorld: Geometry-Consistent Panoramic Video World Modeling	PanoWorld：提出几何一致的全景视频世界建模方法，从单张图像和文本生成逼真全景视频。	world model world models geometric consistency	✅
18	Entity-Centric World Models: Interaction-Aware Masking for Causal Video Prediction	提出交互感知掩码的IA-JEPA模型，用于提升因果视频预测的性能。	world model world models JEPA
19	EgoExo-WM: Unlocking Exo Video for Ego World Models	EgoExo-WM：利用外视角视频增强自视角世界模型	world model world models egocentric
20	ReactiveGWM: Steering NPC in Reactive Game World Models	提出ReactiveGWM，实现游戏中可控NPC的反应式游戏世界建模。	world model world models
21	Social-Mamba: Socially-Aware Trajectory Forecasting with State-Space Models	提出Social-Mamba，利用状态空间模型高效预测人群轨迹，解决社交互动建模难题。	flow matching Mamba egocentric	✅
22	Unlocking Complex Visual Generation via Closed-Loop Verified Reasoning	提出CLVR框架，通过闭环验证推理提升复杂视觉生成效果	reinforcement learning distillation multimodal

🔬 支柱九：具身大模型 (Embodied Foundation Models) (20 篇)

#	题目	一句话要点	标签	🔗
23	Think When Needed: Adaptive Reasoning-Driven Multimodal Embeddings with a Dual-LoRA Architecture	提出TWN：一种基于双LoRA架构的自适应推理多模态嵌入框架，提升效率和质量。	large language model multimodal chain-of-thought
24	MemLens: Benchmarking Multimodal Long-Term Memory in Large Vision-Language Models	提出MEMLENS基准，系统评估大型视觉语言模型在多模态长期记忆中的表现。	multimodal visual grounding	✅
25	MemEye: A Visual-Centric Evaluation Framework for Multimodal Agent Memory	MemEye：提出视觉中心的多模态Agent记忆评估框架，解决现有方法忽略细粒度视觉证据的问题。	multimodal
26	Do Composed Image Retrieval Benchmarks Require Multimodal Composition?	揭示组合图像检索基准测试中的单模态捷径问题，并提出更严格的评估方法。	multimodal
27	Diagnosing and Correcting Concept Omission in Multimodal Diffusion Transformers	提出OSI方法，通过增强遗漏信号显著改善多模态扩散模型中的概念遗漏问题	multimodal
28	Editor's Choice: Evaluating Abstract Intent in Image Editing through Atomic Entity Analysis	提出Entity-Rubrics框架与AbstractEdit基准，评估图像编辑中抽象意图的理解能力	multimodal instruction following
29	Do We Really Need External Tools to Mitigate Hallucinations? SIRA: Shared-Prefix Internal Reconstruction of Attribution	提出SIRA：一种无需外部工具缓解大模型幻觉的内部对比解码框架	multimodal visual grounding
30	TOPOS: High-Fidelity and Efficient Industry-Grade 3D Head Generation	TOPOS：高保真、高效的工业级3D头部生成框架，满足固定拓扑需求。	large language model multimodal
31	Mitigating Mask Prior Drift and Positional Attention Collapse in Large Diffusion Vision-Language Models	针对大型扩散视觉-语言模型，提出Mask先验抑制和单调RoPE缩放，解决长文本生成中的重复生成和视觉 grounding 退化问题。	multimodal visual grounding
32	DermAgent: A Self-Reflective Agentic System for Dermatological Image Analysis with Multi-Tool Reasoning and Traceable Decision-Making	DermAgent：一种自反思Agent系统，用于可追溯决策的皮肤病学图像分析	large language model multimodal	✅
33	Articraft: An Agentic System for Scalable Articulated 3D Asset Generation	Articraft：一种基于Agent的可扩展铰接3D资产生成系统	large language model
34	On the Cultural Anachronism and Temporal Reasoning in Vision Language Models	提出TAB-VLM基准测试，揭示VLM在文化遗产理解中的文化时代错误问题。	multimodal
35	Characterizing the visual representation of objects from the child's view	分析儿童视角视频，揭示早期视觉经验中物体表征的特点	multimodal
36	MHSA: A Lightweight Framework for Mitigating Hallucinations via Steered Attention in LVLMs	提出MHSA框架，通过引导注意力机制缓解大型视觉语言模型中的幻觉问题	multimodal
37	SteerSeg: Attention Steering for Reasoning Video Segmentation	SteerSeg：通过注意力引导实现视频分割推理，提升LVLM空间定位能力	chain-of-thought
38	Exploring Vision-Language Models for Online Signature Verification: A Zero-Shot Capability Study	探索视觉-语言模型在在线签名验证中的零样本能力	chain-of-thought
39	MechVerse: Evaluating Physical Motion Consistency in Video Generation Models	MechVerse：提出机械运动一致性评估基准，用于评估视频生成模型	instruction following
40	SceneForge: Structured World Supervision from 3D Interventions	SceneForge：基于3D干预的可编辑场景结构化监督框架	multimodal
41	ELDOR: A Dataset and Benchmark for Illegal Gold Mining in the Amazon Rainforest	ELDOR：亚马逊雨林非法金矿开采监测数据集与基准	foundation model multimodal
42	Deep Pre-Alignment for VLMs	提出深度预对齐(DPA)架构，解决视觉语言模型中的模态对齐难题。	large language model multimodal

🔬 支柱一：机器人控制 (Robot Control) (9 篇)

#	题目	一句话要点	标签	🔗
43	Real2Sim in HOI: Toward Physically Plausible HOI Reconstruction from Monocular Videos	HA-HOI：从单目视频重建物理可信的人-物交互动画	humanoid manipulation real2sim	✅
44	Evo-Depth: A Lightweight Depth-Enhanced Vision-Language-Action Model	Evo-Depth：提出轻量级深度增强的视觉-语言-动作模型，提升机器人操作的空间理解能力。	manipulation spatial relationship vision-language-action
45	Breaking Dual Bottlenecks: Evolving Unified Multimodal Models into Self-Adaptive Interleaved Visual Reasoners	提出自适应交错视觉推理器，解决多模态统一模型在Anything-to-Image任务中的双重瓶颈	manipulation multimodal	✅
46	PhyMotion: Structured 3D Motion Reward for Physics-Grounded Human Video Generation	提出PhyMotion：一种基于物理的结构化3D运动奖励，用于提升人体视频生成真实性。	humanoid reinforcement learning SMPL
47	DriveCtrl: Conditioned Sim-to-Real Driving Video Generation	DriveCtrl：提出深度条件控制的Sim-to-Real驾驶视频生成框架，提升真实感和下游任务性能。	sim-to-real foundation model
48	ICED: Concept-level Machine Unlearning via Interpretable Concept Decomposition	提出概念级机器遗忘框架以解决视觉语言模型知识删除问题	manipulation large language model multimodal
49	CreFlow: Corrective Reflow for Sparse-Reward Embodied Video Diffusion RL	CreFlow：用于稀疏奖励具身视频扩散强化学习的修正性重流方法	manipulation bi-manual bimanual manipulation
50	Systematic Discovery of Semantic Attacks in Online Map Construction through Conditional Diffusion	MIRAGE：利用条件扩散模型在在线地图构建中发现语义攻击，绕过防御并注入虚假边界。	motion planning
51	ICED: Concept-level Machine Unlearning via Interpretable Concept Decomposition	ICED：提出一种基于可解释概念分解的概念级机器遗忘方法，用于视觉-语言模型。	manipulation large language model multimodal

🔬 支柱三：空间感知与语义 (Perception & Semantics) (9 篇)

#	题目	一句话要点	标签	🔗
52	Efficient Dense Matching for Enhanced Gaussian Splatting Using AV1 Motion Vectors	利用AV1运动矢量加速3D高斯溅射，提升重建质量与效率	3D gaussian splatting 3DGS gaussian splatting
53	Denoising-GS: Gaussian Splatting with Spatial-aware Denoising	Denoising-GS：基于空间感知去噪的高斯溅射方法	3D gaussian splatting 3DGS gaussian splatting
54	3D Skew-Normal Splatting	提出Skew-Normal Splatting，通过可学习偏度参数提升3D高斯溅射在非对称结构建模上的紧凑性和精度。	3D gaussian splatting 3DGS gaussian splatting
55	VGGT-$Ω$	VGGT-Ω：通过大规模训练和高效架构显著提升静态与动态场景重建精度。	VGGT vision-language-action
56	VGGT-Edit: Feed-forward Native 3D Scene Editing with Residual Field Prediction	VGGT-Edit：提出基于残差场预测的前馈原生3D场景文本编辑方法	scene reconstruction VGGT
57	CalibAnyView: Beyond Single-View Camera Calibration in the Wild	CalibAnyView：提出多视角相机自标定框架，提升野外场景几何感知能力	3D reconstruction geometric consistency
58	Towards Accurate Single Panoramic 3D Detection: A Semantic Gaussian Centric Approach	提出PanoGSDet，基于语义高斯表示实现精确单目全景3D目标检测	depth estimation scene understanding
59	TurboVGGT: Fast Visual Geometry Reconstruction with Adaptive Alternating Attention	TurboVGGT：基于自适应交替注意力的快速视觉几何重建	3D reconstruction	✅
60	3D Skew-Normal Splatting	提出Skew-Normal Splatting，通过可学习偏度参数提升3D高斯溅射在非对称结构场景下的重建质量。	3D gaussian splatting 3DGS gaussian splatting

🔬 支柱四：生成式动作 (Generative Motion) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
61	Multi-scale Coarse-to-fine Modeling for Test-time Human Motion Control	提出MSCoT，一种用于测试时人体运动控制的多尺度粗到精模型。	text-to-motion motion synthesis motion generation
62	Multimodal Object Detection Under Sparse Forest-Canopy Occlusion	提出一种稀疏森林遮蔽下的多模态目标检测方法，用于提升复杂环境下的人员检测能力。	penetration multimodal

🔬 支柱七：动作重定向 (Motion Retargeting) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
63	MultiEmo-Bench: Multi-label Visual Emotion Analysis for Multi-modal Large Language Models	提出MultiEmo-Bench多标签视觉情感分析基准，用于评估多模态大语言模型的情感理解能力。	motion prediction large language model multimodal
64	Analogical Trajectory Transfer	提出一种无训练的轨迹类比迁移方法，实现跨场景语义一致的运动轨迹转换。	human-to-robot foundation model

🔬 支柱六：视频提取与匹配 (Video Extraction) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
65	ViMU: Benchmarking Video Metaphorical Understanding	提出ViMU基准，用于评估视频隐喻理解能力，弥合视频语义理解的差距。	HuMoR multimodal
66	Minerva-Ego: Spatiotemporal Hints for Egocentric Video Understanding	Minerva-Ego：利用时空提示增强第一视角视频理解	egocentric spatiotemporal multimodal	✅

🔬 支柱八：物理动画 (Physics-based Animation) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
67	Local Spatiotemporal Convolutional Network for Robust Gait Recognition	提出局部时空卷积网络LSTCN，用于解决步态识别中运动模式提取难题。	spatiotemporal

⬅️ 返回 cs.CV 首页 · 🏠 返回主页

cs.CV（2026-05-14）

🎯 兴趣领域导航

🔬 支柱二：RL算法与架构 (RL & Architecture) (22 篇)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (20 篇)

🔬 支柱一：机器人控制 (Robot Control) (9 篇)

🔬 支柱三：空间感知与语义 (Perception & Semantics) (9 篇)

🔬 支柱四：生成式动作 (Generative Motion) (2 篇)

🔬 支柱七：动作重定向 (Motion Retargeting) (2 篇)

🔬 支柱六：视频提取与匹配 (Video Extraction) (2 篇)

🔬 支柱八：物理动画 (Physics-based Animation) (1 篇)

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理