cs.CV(2026-01-16)
📊 共 12 篇论文 | 🔗 1 篇有代码
🎯 兴趣领域导航
支柱二:RL算法与架构 (RL & Architecture) (5 🔗1)
支柱九:具身大模型 (Embodied Foundation Models) (5)
支柱一:机器人控制 (Robot Control) (1)
支柱三:空间感知与语义 (Perception & Semantics) (1)
🔬 支柱二:RL算法与架构 (RL & Architecture) (5 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 1 | Generative Scenario Rollouts for End-to-End Autonomous Driving | 提出GeRo框架,通过生成式场景展开提升端到端自动驾驶性能。 | reinforcement learning imitation learning vision-language-action | ||
| 2 | MMedExpert-R1: Strengthening Multimodal Medical Reasoning via Domain-Specific Adaptation and Clinical Guideline Reinforcement | MMedExpert-R1:通过领域自适应和临床指南强化提升多模态医学推理能力 | reinforcement learning multimodal | ||
| 3 | FTDMamba: Frequency-Assisted Temporal Dilation Mamba for Unmanned Aerial Vehicle Video Anomaly Detection | 提出FTDMamba,用于解决动态背景下无人机视频异常检测难题。 | Mamba spatiotemporal | ✅ | |
| 4 | PhysRVG: Physics-Aware Unified Reinforcement Learning for Video Generative Models | 提出PhysRVG,通过物理感知强化学习提升视频生成模型中刚体运动的真实性。 | reinforcement learning | ||
| 5 | SoLA-Vision: Fine-grained Layer-wise Linear Softmax Hybrid Attention | 提出SoLA-Vision,一种细粒度层级线性-Softmax混合注意力视觉模型,提升高分辨率图像处理的效率与精度。 | linear attention representation learning |
🔬 支柱九:具身大模型 (Embodied Foundation Models) (5 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 6 | Vision-as-Inverse-Graphics Agent via Interleaved Multimodal Reasoning | 提出VIGA:通过交错多模态推理实现视觉逆向图形Agent,用于场景重建与编辑。 | multimodal | ||
| 7 | MHA2MLA-VLM: Enabling DeepSeek's Economical Multi-Head Latent Attention across Vision-Language Models | 提出MHA2MLA-VLM,实现DeepSeek经济高效的多头潜在注意力跨视觉-语言模型迁移。 | multimodal | ||
| 8 | Map2Thought: Explicit 3D Spatial Reasoning via Metric Cognitive Maps | 提出Map2Thought框架,通过度量认知地图实现3D视觉语言模型中显式的空间推理。 | chain-of-thought | ||
| 9 | Think-Clip-Sample: Slow-Fast Frame Selection for Video Understanding | 提出Think-Clip-Sample以解决长视频理解中的帧选择问题 | large language model | ||
| 10 | SAMannot: A Memory-Efficient, Local, Open-source Framework for Interactive Video Instance Segmentation based on SAM2 | SAMannot:基于SAM2的内存高效、本地化交互式视频实例分割框架 | foundation model |
🔬 支柱一:机器人控制 (Robot Control) (1 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 11 | X-Distill: Cross-Architecture Vision Distillation for Visuomotor Learning | X-Distill:面向机器人视觉运动学习的跨架构视觉知识蒸馏 | manipulation diffusion policy distillation |
🔬 支柱三:空间感知与语义 (Perception & Semantics) (1 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 12 | IDDR-NGP: Incorporating Detectors for Distractor Removal with Instant Neural Radiance Field | IDDR-NGP:融合检测器的Instant-NGP场景干扰物移除方法 | neural radiance field |