cs.CV（2026-05-07）

📊 共 28 篇论文 | 🔗 4 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (11 🔗1) 支柱二：RL算法与架构 (RL & Architecture) (9 🔗2) 支柱六：视频提取与匹配 (Video Extraction) (3) 支柱三：空间感知与语义 (Perception & Semantics) (3) 支柱一：机器人控制 (Robot Control) (2 🔗1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (11 篇)

#	题目	一句话要点	标签	🔗	⭐
1	Bridging visual saliency and large language models for explainable deep learning in medical imaging	提出结合视觉显著性和大语言模型的医学影像可解释深度学习框架	large language model multimodal
2	Are We Making Progress in Multimodal Domain Generalization? A Comprehensive Benchmark Study	MMDG-Bench：多模态领域泛化综合基准测试，揭示现有方法泛化能力不足	multimodal
3	Bringing Multimodal Large Language Models to Infrared-Visible Image Fusion Quality Assessment	提出FuScore框架，利用多模态大模型实现红外-可见光图像融合质量的细粒度评估	large language model multimodal
4	TRAJGANR: Trajectory-Centric Urban Multimodal Learning via Geospatially Aligned Neural Representations	提出TrajGANR框架，通过地理空间对齐神经表征实现轨迹中心化的城市多模态学习	foundation model multimodal
5	From Review to Design: Ethical Multimodal Driver Monitoring Systems for Risk Mitigation, Incident Response, and Accountability in Automated Vehicles	提出模块化伦理设计框架，以解决自动驾驶中多模态驾驶员监控系统的隐私、公平性与问责挑战。	multimodal
6	Steering Visual Generation in Unified Multimodal Models with Understanding Supervision	提出UNO框架，通过理解监督引导统一多模态模型的视觉生成能力	multimodal
7	R$^3$L: Reasoning 3D Layouts from Relative Spatial Relations	提出R$^3$L框架，通过不变空间分解与一致性想象解决3D布局生成中的多跳空间推理难题。	large language model multimodal	✅
8	MedHorizon: Towards Long-context Medical Video Understanding in the Wild	提出MedHorizon基准测试，旨在解决真实临床场景下长视频医疗理解的证据检索与推理难题。	large language model multimodal
9	LensVLM: Selective Context Expansion for Compressed Visual Representation of Text	提出LensVLM框架，通过选择性上下文扩展实现高效的文本视觉压缩与理解	multimodal
10	A$^2$RD: Agentic Autoregressive Diffusion for Long Video Consistency	提出A$^2$RD架构，通过代理式自回归扩散模型解决长视频生成中的语义漂移与叙事崩塌问题。	multimodal
11	VideoRouter: Query-Adaptive Dual Routing for Efficient Long-Video Understanding	提出VideoRouter框架，通过查询自适应双重路由机制实现高效长视频理解	multimodal

🔬 支柱二：RL算法与架构 (RL & Architecture) (9 篇)

#	题目	一句话要点	标签	🔗	⭐
12	EA-WM: Event-Aware Generative World Model with Structured Kinematic-to-Visual Action Fields	提出EA-WM，利用事件感知生成世界模型，解决机器人操作中精确控制与视觉感知对齐问题。	policy learning world model world models
13	Reconstruction or Semantics? What Makes a Latent Space Useful for Robotic World Models	针对机器人世界模型的潜在空间选择，提出语义对齐的表征优于重建。	world model world models JEPA
14	Render, Don't Decode: Weight-Space World Models with Latent Structural Disentanglement	提出NOVA：一种基于权重空间的、具有潜在结构解耦的世界模型，用于可控视频预测。	world model world models latent dynamics
15	Closing the Loop: Unified 3D Scene Generation and Immersive Interaction via LLM-RL Coupling	提出基于LLM-RL耦合的统一框架，实现3D场景生成与沉浸式交互的闭环。	reinforcement learning large language model	✅
16	DINORANKCLIP: DINOv3 Distillation and Injection for Vision-Language Pretraining with High-Order Ranking Consistency	DINORANKCLIP：通过DINOv3蒸馏和高阶排序一致性进行视觉-语言预训练	distillation
17	HumanNet: Scaling Human-centric Video Learning to One Million Hours	提出HumanNet大规模以人为中心视频语料库，通过海量交互数据赋能具身智能模型训练	representation learning motion generation human-object interaction
18	Render, Don't Decode: Weight-Space World Models with Latent Structural Disentanglement	提出NOVA世界模型框架：通过权重空间隐式神经表示实现结构解耦与高效视频预测	world model world models latent dynamics
19	VISD: Enhancing Video Reasoning via Structured Self-Distillation	提出VISD结构化自蒸馏框架，通过多维度诊断反馈提升视频大模型推理能力与训练效率	reinforcement learning distillation privileged information
20	Not All Tokens Need 40 Steps: Heterogeneous Step Allocation in Diffusion Transformers for Efficient Video Generation	提出异构步长分配（HSA）算法，通过动态调整去噪步长实现高效视频生成	flow matching spatiotemporal	✅

🔬 支柱六：视频提取与匹配 (Video Extraction) (3 篇)

#	题目	一句话要点	标签	🔗	⭐
21	NavOne: One-Step Global Planning for Vision-Language Navigation on Top-Down Maps	NavOne：基于顶视图地图的视觉-语言导航单步全局规划方法	egocentric VLN
22	MobileEgo Anywhere: Open Infrastructure for long horizon egocentric data on commodity hardware	MobileEgo Anywhere：利用移动设备实现长时间第一视角数据采集的开放平台	egocentric vision-language-action VLA
23	NavOne: One-Step Global Planning for Vision-Language Navigation on Top-Down Maps	提出NavOne框架：通过自顶向下地图实现视觉语言导航的一步式全局路径规划	egocentric VLN

🔬 支柱三：空间感知与语义 (Perception & Semantics) (3 篇)

#	题目	一句话要点	标签	🔗	⭐
24	AdpSplit: Error-Driven Adaptive Splitting for Faster Geometry Discovery in 3D Gaussian Splatting	提出AdpSplit：一种基于误差驱动的自适应分裂算子，旨在加速3D高斯泼溅的几何发现过程。	3D gaussian splatting 3DGS gaussian splatting
25	OneViewAll: Semantic Prior Guided One-View 6D Pose Estimation for Novel Objects	提出OneViewAll框架，通过语义先验引导实现单视图无模型6D物体位姿估计	6D pose estimation
26	iPhoneBlur: A Difficulty-Stratified Benchmark for Consumer Device Motion Deblurring	提出iPhoneBlur基准测试：通过难度分层评估消费级设备运动去模糊性能	optical flow

🔬 支柱一：机器人控制 (Robot Control) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
27	TriRelVLA: Triadic Relational Structure for Generalizable Embodied Manipulation	提出TriRelVLA，利用三元关系结构提升具身操作的泛化性	manipulation vision-language-action VLA
28	BAMI: Training-Free Bias Mitigation in GUI Grounding	提出BAMI，通过无训练方式缓解GUI grounding中的偏差问题	manipulation	✅

⬅️ 返回 cs.CV 首页 · 🏠 返回主页