cs.CV(2026-05-07)

📊 共 28 篇论文 | 🔗 4 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (11 🔗1) 支柱二:RL算法与架构 (RL & Architecture) (9 🔗2) 支柱六:视频提取与匹配 (Video Extraction) (3) 支柱三:空间感知与语义 (Perception & Semantics) (3) 支柱一:机器人控制 (Robot Control) (2 🔗1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (11 篇)

#题目一句话要点标签🔗
1 Bridging visual saliency and large language models for explainable deep learning in medical imaging 提出结合视觉显著性和大语言模型的医学影像可解释深度学习框架 large language model multimodal
2 Are We Making Progress in Multimodal Domain Generalization? A Comprehensive Benchmark Study MMDG-Bench:多模态领域泛化综合基准测试,揭示现有方法泛化能力不足 multimodal
3 Bringing Multimodal Large Language Models to Infrared-Visible Image Fusion Quality Assessment 提出FuScore框架,利用多模态大模型实现红外-可见光图像融合质量的细粒度评估 large language model multimodal
4 TRAJGANR: Trajectory-Centric Urban Multimodal Learning via Geospatially Aligned Neural Representations 提出TrajGANR框架,通过地理空间对齐神经表征实现轨迹中心化的城市多模态学习 foundation model multimodal
5 From Review to Design: Ethical Multimodal Driver Monitoring Systems for Risk Mitigation, Incident Response, and Accountability in Automated Vehicles 提出模块化伦理设计框架,以解决自动驾驶中多模态驾驶员监控系统的隐私、公平性与问责挑战。 multimodal
6 Steering Visual Generation in Unified Multimodal Models with Understanding Supervision 提出UNO框架,通过理解监督引导统一多模态模型的视觉生成能力 multimodal
7 R$^3$L: Reasoning 3D Layouts from Relative Spatial Relations 提出R$^3$L框架,通过不变空间分解与一致性想象解决3D布局生成中的多跳空间推理难题。 large language model multimodal
8 MedHorizon: Towards Long-context Medical Video Understanding in the Wild 提出MedHorizon基准测试,旨在解决真实临床场景下长视频医疗理解的证据检索与推理难题。 large language model multimodal
9 LensVLM: Selective Context Expansion for Compressed Visual Representation of Text 提出LensVLM框架,通过选择性上下文扩展实现高效的文本视觉压缩与理解 multimodal
10 A$^2$RD: Agentic Autoregressive Diffusion for Long Video Consistency 提出A$^2$RD架构,通过代理式自回归扩散模型解决长视频生成中的语义漂移与叙事崩塌问题。 multimodal
11 VideoRouter: Query-Adaptive Dual Routing for Efficient Long-Video Understanding 提出VideoRouter框架,通过查询自适应双重路由机制实现高效长视频理解 multimodal

🔬 支柱二:RL算法与架构 (RL & Architecture) (9 篇)

#题目一句话要点标签🔗
12 EA-WM: Event-Aware Generative World Model with Structured Kinematic-to-Visual Action Fields 提出EA-WM,利用事件感知生成世界模型,解决机器人操作中精确控制与视觉感知对齐问题。 policy learning world model world models
13 Reconstruction or Semantics? What Makes a Latent Space Useful for Robotic World Models 针对机器人世界模型的潜在空间选择,提出语义对齐的表征优于重建。 world model world models JEPA
14 Render, Don't Decode: Weight-Space World Models with Latent Structural Disentanglement 提出NOVA:一种基于权重空间的、具有潜在结构解耦的世界模型,用于可控视频预测。 world model world models latent dynamics
15 Closing the Loop: Unified 3D Scene Generation and Immersive Interaction via LLM-RL Coupling 提出基于LLM-RL耦合的统一框架,实现3D场景生成与沉浸式交互的闭环。 reinforcement learning large language model
16 DINORANKCLIP: DINOv3 Distillation and Injection for Vision-Language Pretraining with High-Order Ranking Consistency DINORANKCLIP:通过DINOv3蒸馏和高阶排序一致性进行视觉-语言预训练 distillation
17 HumanNet: Scaling Human-centric Video Learning to One Million Hours 提出HumanNet大规模以人为中心视频语料库,通过海量交互数据赋能具身智能模型训练 representation learning motion generation human-object interaction
18 Render, Don't Decode: Weight-Space World Models with Latent Structural Disentanglement 提出NOVA世界模型框架:通过权重空间隐式神经表示实现结构解耦与高效视频预测 world model world models latent dynamics
19 VISD: Enhancing Video Reasoning via Structured Self-Distillation 提出VISD结构化自蒸馏框架,通过多维度诊断反馈提升视频大模型推理能力与训练效率 reinforcement learning distillation privileged information
20 Not All Tokens Need 40 Steps: Heterogeneous Step Allocation in Diffusion Transformers for Efficient Video Generation 提出异构步长分配(HSA)算法,通过动态调整去噪步长实现高效视频生成 flow matching spatiotemporal

🔬 支柱六:视频提取与匹配 (Video Extraction) (3 篇)

#题目一句话要点标签🔗
21 NavOne: One-Step Global Planning for Vision-Language Navigation on Top-Down Maps NavOne:基于顶视图地图的视觉-语言导航单步全局规划方法 egocentric VLN
22 MobileEgo Anywhere: Open Infrastructure for long horizon egocentric data on commodity hardware MobileEgo Anywhere:利用移动设备实现长时间第一视角数据采集的开放平台 egocentric vision-language-action VLA
23 NavOne: One-Step Global Planning for Vision-Language Navigation on Top-Down Maps 提出NavOne框架:通过自顶向下地图实现视觉语言导航的一步式全局路径规划 egocentric VLN

🔬 支柱三:空间感知与语义 (Perception & Semantics) (3 篇)

#题目一句话要点标签🔗
24 AdpSplit: Error-Driven Adaptive Splitting for Faster Geometry Discovery in 3D Gaussian Splatting 提出AdpSplit:一种基于误差驱动的自适应分裂算子,旨在加速3D高斯泼溅的几何发现过程。 3D gaussian splatting 3DGS gaussian splatting
25 OneViewAll: Semantic Prior Guided One-View 6D Pose Estimation for Novel Objects 提出OneViewAll框架,通过语义先验引导实现单视图无模型6D物体位姿估计 6D pose estimation
26 iPhoneBlur: A Difficulty-Stratified Benchmark for Consumer Device Motion Deblurring 提出iPhoneBlur基准测试:通过难度分层评估消费级设备运动去模糊性能 optical flow

🔬 支柱一:机器人控制 (Robot Control) (2 篇)

#题目一句话要点标签🔗
27 TriRelVLA: Triadic Relational Structure for Generalizable Embodied Manipulation 提出TriRelVLA,利用三元关系结构提升具身操作的泛化性 manipulation vision-language-action VLA
28 BAMI: Training-Free Bias Mitigation in GUI Grounding 提出BAMI,通过无训练方式缓解GUI grounding中的偏差问题 manipulation

⬅️ 返回 cs.CV 首页 · 🏠 返回主页