cs.CV(2025-04-20)
📊 共 16 篇论文 | 🔗 2 篇有代码
🎯 兴趣领域导航
支柱三:空间感知与语义 (Perception & Semantics) (6 🔗1)
支柱九:具身大模型 (Embodied Foundation Models) (6 🔗1)
支柱二:RL算法与架构 (RL & Architecture) (3)
支柱一:机器人控制 (Robot Control) (1)
🔬 支柱三:空间感知与语义 (Perception & Semantics) (6 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 1 | NVSMask3D: Hard Visual Prompting with Camera Pose Interpolation for 3D Open Vocabulary Instance Segmentation | 提出NVSMask3D,利用相机位姿插值和硬视觉提示实现3D开放词汇实例分割 | 3D gaussian splatting gaussian splatting splatting | ||
| 2 | VGNC: Reducing the Overfitting of Sparse-view 3DGS via Validation-guided Gaussian Number Control | VGNC:通过验证引导的高斯数量控制减少稀疏视角3DGS的过拟合 | 3D gaussian splatting 3DGS gaussian splatting | ||
| 3 | Are Vision LLMs Road-Ready? A Comprehensive Benchmark for Safety-Critical Driving Video Understanding | 提出DVBench:用于评估视觉语言模型在安全关键驾驶场景理解能力的综合基准 | scene understanding large language model multimodal | ✅ | |
| 4 | Metamon-GS: Enhancing Representability with Variance-Guided Densification and Light Encoding | Metamon-GS:通过方差引导的密度增加和光照编码增强3D高斯表达能力 | 3D gaussian splatting 3DGS gaussian splatting | ||
| 5 | Back on Track: Bundle Adjustment for Dynamic Scene Reconstruction | BA-Track:结合Bundle Adjustment与3D跟踪,实现动态场景下的精确重建 | scene reconstruction | ||
| 6 | Seurat: From Moving Points to Depth | Seurat:利用运动点轨迹推断单目视频深度变化 | depth estimation spatial relationship |
🔬 支柱九:具身大模型 (Embodied Foundation Models) (6 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 7 | Generative Multimodal Pretraining with Discrete Diffusion Timestep Tokens | 提出基于离散扩散时间步Token的生成式多模态预训练方法,提升多模态理解与生成能力。 | large language model multimodal | ✅ | |
| 8 | Video-MMLU: A Massive Multi-Discipline Lecture Understanding Benchmark | 提出Video-MMLU,用于评估LMMs在多学科讲座理解中的能力 | large language model multimodal | ||
| 9 | LSP-ST: Ladder Shape-Biased Side-Tuning for Robust Infrared Small Target Detection | 提出LSP-ST,通过梯形形状偏置的侧调优实现鲁棒的红外小目标检测。 | foundation model | ||
| 10 | LGD: Leveraging Generative Descriptions for Zero-Shot Referring Image Segmentation | LGD:利用生成式描述增强零样本指代图像分割的区域-文本匹配 | large language model | ||
| 11 | ResNetVLLM -- Multi-modal Vision LLM for the Video Understanding Task | 提出ResNetVLLM,用于零样本视频理解的多模态视觉语言模型 | large language model | ||
| 12 | ResNetVLLM-2: Addressing ResNetVLLM's Multi-Modal Hallucinations | ResNetVLLM-2:通过忠实度检测和RAG缓解ResNetVLLM中的多模态幻觉问题 | large language model |
🔬 支柱二:RL算法与架构 (RL & Architecture) (3 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 13 | Relation-R1: Progressively Cognitive Chain-of-Thought Guided Reinforcement Learning for Unified Relation Comprehension | 提出Relation-R1,通过认知链式思考引导的强化学习,统一解决关系理解问题。 | reinforcement learning large language model chain-of-thought | ||
| 14 | FlowLoss: Dynamic Flow-Conditioned Loss Strategy for Video Diffusion Models | FlowLoss:面向视频扩散模型的动态光流条件损失策略,提升时序一致性 | flow matching optical flow | ||
| 15 | Exposing the Copycat Problem of Imitation-based Planner: A Novel Closed-Loop Simulator, Causal Benchmark and Joint IL-RL Baseline | 提出闭环模拟器与因果基准,解决模仿学习规划器的“抄袭”问题 | reinforcement learning imitation learning |
🔬 支柱一:机器人控制 (Robot Control) (1 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 16 | VM-BHINet:Vision Mamba Bimanual Hand Interaction Network for 3D Interacting Hand Mesh Recovery From a Single RGB Image | 提出VM-BHINet,利用Vision Mamba解决单RGB图像中的3D交互手部网格重建问题 | bi-manual Mamba SSM |