cs.CV(2025-03-18)
📊 共 23 篇论文 | 🔗 2 篇有代码
🎯 兴趣领域导航
支柱三:空间感知与语义 (Perception & Semantics) (6)
支柱九:具身大模型 (Embodied Foundation Models) (6)
支柱二:RL算法与架构 (RL & Architecture) (5)
支柱一:机器人控制 (Robot Control) (2 🔗1)
支柱四:生成式动作 (Generative Motion) (1)
支柱八:物理动画 (Physics-based Animation) (1 🔗1)
支柱六:视频提取与匹配 (Video Extraction) (1)
支柱七:动作重定向 (Motion Retargeting) (1)
🔬 支柱三:空间感知与语义 (Perception & Semantics) (6 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 1 | Zero-Shot Scene Understanding with Multimodal Large Language Models for Automated Vehicles | 利用多模态大语言模型实现零样本场景理解,提升自动驾驶车辆决策能力 | scene understanding large language model multimodal | ||
| 2 | Learning Efficient Fuse-and-Refine for Feed-Forward 3D Gaussian Splatting | 提出Fuse-and-Refine模块,提升前馈3D高斯溅射在静态和动态场景重建中的效率和质量。 | 3D gaussian splatting gaussian splatting splatting | ||
| 3 | HandSCS: Structural Coordinate Space for Animatable Hand Gaussian Splatting | HandSCS:用于可动画手部高斯溅射的结构化坐标空间 | 3D gaussian splatting gaussian splatting splatting | ||
| 4 | Leveraging Vision-Language Models for Open-Vocabulary Instance Segmentation and Tracking | 利用视觉-语言模型实现开放词汇实例分割与跟踪 | open-vocabulary open vocabulary | ||
| 5 | SketchSplat: 3D Edge Reconstruction via Differentiable Multi-view Sketch Splatting | SketchSplat:提出可微多视图草图溅射的三维边缘重建方法 | splatting | ||
| 6 | These Magic Moments: Differentiable Uncertainty Quantification of Radiance Field Models | 提出基于高阶矩的辐射场不确定性量化方法,提升下游任务性能。 | neural radiance field scene understanding |
🔬 支柱九:具身大模型 (Embodied Foundation Models) (6 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 7 | Image Captioning Evaluation in the Age of Multimodal LLMs: Challenges and Future Perspectives | 针对多模态LLM时代图像描述评估的挑战与未来方向的综述 | large language model multimodal | ||
| 8 | The Power of Context: How Multimodality Improves Image Super-Resolution | 提出多模态引导的扩散模型,提升图像超分辨率重建的视觉质量和细节保真度。 | multimodal | ||
| 9 | FlexVLN: Flexible Adaptation for Diverse Vision-and-Language Navigation Tasks | FlexVLN:一种灵活适应多种视觉-语言导航任务的方法 | VLN large language model | ||
| 10 | MusicInfuser: Making Video Diffusion Listen and Dance | MusicInfuser:使视频扩散模型能够“听”音乐并生成舞蹈视频 | multimodal | ||
| 11 | MP-GUI: Modality Perception with MLLMs for GUI Understanding | 提出MP-GUI,利用多模态大语言模型提升GUI界面理解能力 | large language model | ||
| 12 | Growing a Twig to Accelerate Large Vision-Language Models | 提出TwigVLM,通过生长轻量级分支加速大型视觉语言模型,提升推理速度和精度。 | multimodal |
🔬 支柱二:RL算法与架构 (RL & Architecture) (5 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 13 | Advancing Medical Representation Learning Through High-Quality Data | 提出Open-PMC高质量医学图文数据集,提升多模态医学表征学习性能 | representation learning multimodal | ||
| 14 | DUNE: Distilling a Universal Encoder from Heterogeneous 2D and 3D Teachers | DUNE:从异构2D和3D教师模型中蒸馏通用编码器 | distillation depth estimation foundation model | ||
| 15 | State Space Model Meets Transformer: A New Paradigm for 3D Object Detection | 提出基于交互式状态空间模型的3D目标检测新范式DEST,显著提升性能。 | SSM state space model | ||
| 16 | VEGGIE: Instructional Editing and Reasoning Video Concepts with Grounded Generation | VEGGIE:提出基于指令的视频编辑框架,实现概念编辑、定位和推理的统一 | curriculum learning multimodal | ||
| 17 | DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies | DualToken:利用双视觉词汇统一视觉理解与生成 | contrastive learning large language model |
🔬 支柱一:机器人控制 (Robot Control) (2 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 18 | Cosmos-Transfer1: Conditional World Generation with Adaptive Multimodal Control | Cosmos-Transfer:基于自适应多模态控制的条件世界生成模型,应用于Sim2Real。 | sim2real multimodal | ✅ | |
| 19 | ShapeShift: Towards Text-to-Shape Arrangement Synthesis with Content-Aware Geometric Constraints | ShapeShift:提出一种基于内容感知几何约束的文本驱动形状排列合成方法 | manipulation distillation spatial relationship |
🔬 支柱四:生成式动作 (Generative Motion) (1 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 20 | SALAD: Skeleton-aware Latent Diffusion for Text-driven Motion Generation and Editing | SALAD:提出骨骼感知潜在扩散模型,用于文本驱动的动作生成与编辑 | text-driven motion motion generation |
🔬 支柱八:物理动画 (Physics-based Animation) (1 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 21 | Tracking Meets Large Multimodal Models for Driving Scenario Understanding | 提出融合跟踪信息的大型多模态模型,提升自动驾驶场景理解能力 | spatiotemporal multimodal | ✅ |
🔬 支柱六:视频提取与匹配 (Video Extraction) (1 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 22 | Validation of Human Pose Estimation and Human Mesh Recovery for Extracting Clinically Relevant Motion Data from Videos | 验证人体姿态估计和人体网格重建技术,用于从视频中提取临床相关运动数据 | human mesh recovery |
🔬 支柱七:动作重定向 (Motion Retargeting) (1 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 23 | Comp-Attn: Present-and-Align Attention for Compositional Video Generation | 提出Comp-Attn,通过Present-and-Align注意力机制解决组合视频生成中的主体呈现和关系对齐问题。 | spatial relationship latent optimization |