cs.CV(2026-05-15)
📊 共 40 篇论文 | 🔗 9 篇有代码
🎯 兴趣领域导航
支柱九:具身大模型 (Embodied Foundation Models) (16 🔗2)
支柱三:空间感知与语义 (Perception & Semantics) (11 🔗3)
支柱二:RL算法与架构 (RL & Architecture) (6 🔗2)
支柱一:机器人控制 (Robot Control) (3)
支柱四:生成式动作 (Generative Motion) (3 🔗2)
支柱八:物理动画 (Physics-based Animation) (1)
🔬 支柱九:具身大模型 (Embodied Foundation Models) (16 篇)
🔬 支柱三:空间感知与语义 (Perception & Semantics) (11 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 17 | Robust Prior-Guided Segmentation for Editable 3D Gaussian Splatting | 提出基于先验引导的分割方法,实现可编辑的3D高斯溅射 | 3D gaussian splatting gaussian splatting splatting | ||
| 18 | EndoGSim: Physics-Aware 4D Dynamic Endoscopic Scene Simulations via MLLM-Guided Gaussian Splatting | EndoGSim:基于MLLM引导的高斯溅射实现物理感知的4D动态内窥镜场景仿真 | depth estimation gaussian splatting splatting | ||
| 19 | Unlocking Dense Metric Depth Estimation in VLMs | 提出DepthVLM,将视觉语言模型转化为原生密集深度预测器,提升3D空间推理能力。 | depth estimation metric depth foundation model | ||
| 20 | Learn2Splat: Extending the Horizon of Learned 3DGS Optimization | Learn2Splat:通过元学习扩展3D高斯溅射优化视野 | 3D gaussian splatting 3DGS gaussian splatting | ✅ | |
| 21 | 3D Segmentation Using Viewpoint-Dependent Spatial Relationships | 提出视角依赖的3D指代分割数据集,并设计视角感知的模型以提升空间关系理解。 | scene understanding spatial relationship multimodal | ||
| 22 | Decomposed Vision-Language Alignment for Fine-Grained Open-Vocabulary Segmentation | 提出解耦的视觉-语言对齐框架,用于细粒度开放词汇分割。 | open-vocabulary open vocabulary | ||
| 23 | Self-Prompting Diffusion Transformer for Open-Vocabulary Scene Text Editing via In-Context Learning | 提出自提示扩散Transformer,通过上下文学习实现开放词汇场景文本编辑 | open-vocabulary open vocabulary | ✅ | |
| 24 | RaPD: Resolution-Agnostic Pixel Diffusion via Semantics-Enriched Implicit Representations | 提出RaPD:通过语义增强隐式表示实现分辨率无关的像素扩散模型 | implicit representation | ||
| 25 | GHOST: Geometry-Hierarchical Online Streaming Token Eviction for Efficient 3D Reconstruction | GHOST:提出几何分层在线流式Token淘汰方法,高效实现3D重建 | 3D reconstruction | ✅ | |
| 26 | Not All Tasks Quantize Equally: Fisher-Guided Quantization for Visual Geometry Transformer | 提出Fisher引导量化(FGQ)方法,解决视觉几何Transformer中多任务量化敏感度差异问题。 | depth estimation 3D reconstruction VGGT | ||
| 27 | On RGB-TIR Stereo Calibration under Extreme Resolution Asymmetry | 提出一种RGB-TIR立体标定框架,解决极端分辨率不对称下的标定难题。 | depth estimation multimodal |
🔬 支柱二:RL算法与架构 (RL & Architecture) (6 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 28 | Latent Video Prediction Learns Better World Models | 基于隐空间视频预测,提升视频世界模型的鲁棒性 | world model world models JEPA | ||
| 29 | ChronoEarth-492K: A Large Scale and Long Horizon Spatiotemporal Hyperspectral Earth Observation Dataset and Benchmark | 提出ChronoEarth-492K大规模时空高光谱数据集与基准,促进长时间序列高光谱自监督学习。 | representation learning HSI spatiotemporal | ||
| 30 | DiLA: Disentangled Latent Action World Models | 提出DiLA以解决潜在动作模型的抽象与生成质量权衡问题 | world model world models optical flow | ||
| 31 | 3DTMDet: A Dual-Path Synergy Network of Transformer and SSM for 3D Object Detection in Point Clouds | 提出3DTMDet,结合Transformer和SSM,解决点云目标检测中远距离点稀疏和上下文理解的难题。 | Mamba SSM state space model | ✅ | |
| 32 | Pretraining Objective Matters in Extreme Low-Data FGVC: A Backbone-Controlled Study | 针对极低数据量细粒度分类,研究预训练目标对表征质量的影响 | MAE contrastive learning distillation | ||
| 33 | From Failure to Feedback: Group Revision Unlocks Hard Cases in Object-Level Grounding | 提出Group-Revision优化范式,解决目标级Grounding中困难样本的稀疏奖励问题。 | reinforcement learning reward shaping | ✅ |
🔬 支柱一:机器人控制 (Robot Control) (3 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 34 | Offline Semantic Guidance for Efficient Vision-Language-Action Policy Distillation | 提出VLA-AD以解决VLA策略蒸馏效率问题 | manipulation distillation vision-language-action | ||
| 35 | UAM: A Dual-Stream Perspective on Forgetting in VLA Training | 提出UAM双流架构,解决VLA训练中的多模态能力遗忘问题 | manipulation VLA multimodal | ||
| 36 | WorldAct: Activating Monolithic 3D Worlds into Interactive-Ready Object-Centric Scenes | WorldAct:将静态3D世界转化为可交互的、以对象为中心的场景 | manipulation world model world models |
🔬 支柱四:生成式动作 (Generative Motion) (3 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 37 | AnyAct: Towards Human Reenactment of Character Motion From Video | AnyAct:提出一种从角色视频到人体表演的重定向方法 | motion generation motion retargeting human motion | ||
| 38 | Unsupervised 3D Human Pose Estimation via Conditional Multi-view Ancestral Sampling | 提出条件多视角祖先采样(cMAS)方法,用于无监督单视角3D人体姿态估计。 | motion diffusion model MDM motion diffusion | ✅ | |
| 39 | VAGS: Velocity Adaptive Guidance Scale for Image Editing and Generation | 提出VAGS:一种速度自适应引导缩放方法,用于提升图像编辑和生成质量。 | classifier-free guidance | ✅ |
🔬 支柱八:物理动画 (Physics-based Animation) (1 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 40 | VideoSeeker: Incentivizing Instance-level Video Understanding via Native Agentic Tool Invocation | VideoSeeker:通过原生Agent工具调用,激励实例级视频理解 | spatiotemporal |