cs.CV(2025-03-18)

📊 共 23 篇论文 | 🔗 2 篇有代码

🎯 兴趣领域导航

支柱三:空间感知与语义 (Perception & Semantics) (6) 支柱九:具身大模型 (Embodied Foundation Models) (6) 支柱二:RL算法与架构 (RL & Architecture) (5) 支柱一:机器人控制 (Robot Control) (2 🔗1) 支柱四:生成式动作 (Generative Motion) (1) 支柱八:物理动画 (Physics-based Animation) (1 🔗1) 支柱六:视频提取与匹配 (Video Extraction) (1) 支柱七:动作重定向 (Motion Retargeting) (1)

🔬 支柱三:空间感知与语义 (Perception & Semantics) (6 篇)

#题目一句话要点标签🔗
1 Zero-Shot Scene Understanding with Multimodal Large Language Models for Automated Vehicles 利用多模态大语言模型实现零样本场景理解,提升自动驾驶车辆决策能力 scene understanding large language model multimodal
2 Learning Efficient Fuse-and-Refine for Feed-Forward 3D Gaussian Splatting 提出Fuse-and-Refine模块,提升前馈3D高斯溅射在静态和动态场景重建中的效率和质量。 3D gaussian splatting gaussian splatting splatting
3 HandSCS: Structural Coordinate Space for Animatable Hand Gaussian Splatting HandSCS:用于可动画手部高斯溅射的结构化坐标空间 3D gaussian splatting gaussian splatting splatting
4 Leveraging Vision-Language Models for Open-Vocabulary Instance Segmentation and Tracking 利用视觉-语言模型实现开放词汇实例分割与跟踪 open-vocabulary open vocabulary
5 SketchSplat: 3D Edge Reconstruction via Differentiable Multi-view Sketch Splatting SketchSplat:提出可微多视图草图溅射的三维边缘重建方法 splatting
6 These Magic Moments: Differentiable Uncertainty Quantification of Radiance Field Models 提出基于高阶矩的辐射场不确定性量化方法,提升下游任务性能。 neural radiance field scene understanding

🔬 支柱九:具身大模型 (Embodied Foundation Models) (6 篇)

#题目一句话要点标签🔗
7 Image Captioning Evaluation in the Age of Multimodal LLMs: Challenges and Future Perspectives 针对多模态LLM时代图像描述评估的挑战与未来方向的综述 large language model multimodal
8 The Power of Context: How Multimodality Improves Image Super-Resolution 提出多模态引导的扩散模型,提升图像超分辨率重建的视觉质量和细节保真度。 multimodal
9 FlexVLN: Flexible Adaptation for Diverse Vision-and-Language Navigation Tasks FlexVLN:一种灵活适应多种视觉-语言导航任务的方法 VLN large language model
10 MusicInfuser: Making Video Diffusion Listen and Dance MusicInfuser:使视频扩散模型能够“听”音乐并生成舞蹈视频 multimodal
11 MP-GUI: Modality Perception with MLLMs for GUI Understanding 提出MP-GUI,利用多模态大语言模型提升GUI界面理解能力 large language model
12 Growing a Twig to Accelerate Large Vision-Language Models 提出TwigVLM,通过生长轻量级分支加速大型视觉语言模型,提升推理速度和精度。 multimodal

🔬 支柱二:RL算法与架构 (RL & Architecture) (5 篇)

#题目一句话要点标签🔗
13 Advancing Medical Representation Learning Through High-Quality Data 提出Open-PMC高质量医学图文数据集,提升多模态医学表征学习性能 representation learning multimodal
14 DUNE: Distilling a Universal Encoder from Heterogeneous 2D and 3D Teachers DUNE:从异构2D和3D教师模型中蒸馏通用编码器 distillation depth estimation foundation model
15 State Space Model Meets Transformer: A New Paradigm for 3D Object Detection 提出基于交互式状态空间模型的3D目标检测新范式DEST,显著提升性能。 SSM state space model
16 VEGGIE: Instructional Editing and Reasoning Video Concepts with Grounded Generation VEGGIE:提出基于指令的视频编辑框架,实现概念编辑、定位和推理的统一 curriculum learning multimodal
17 DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies DualToken:利用双视觉词汇统一视觉理解与生成 contrastive learning large language model

🔬 支柱一:机器人控制 (Robot Control) (2 篇)

#题目一句话要点标签🔗
18 Cosmos-Transfer1: Conditional World Generation with Adaptive Multimodal Control Cosmos-Transfer:基于自适应多模态控制的条件世界生成模型,应用于Sim2Real。 sim2real multimodal
19 ShapeShift: Towards Text-to-Shape Arrangement Synthesis with Content-Aware Geometric Constraints ShapeShift:提出一种基于内容感知几何约束的文本驱动形状排列合成方法 manipulation distillation spatial relationship

🔬 支柱四:生成式动作 (Generative Motion) (1 篇)

#题目一句话要点标签🔗
20 SALAD: Skeleton-aware Latent Diffusion for Text-driven Motion Generation and Editing SALAD:提出骨骼感知潜在扩散模型,用于文本驱动的动作生成与编辑 text-driven motion motion generation

🔬 支柱八:物理动画 (Physics-based Animation) (1 篇)

#题目一句话要点标签🔗
21 Tracking Meets Large Multimodal Models for Driving Scenario Understanding 提出融合跟踪信息的大型多模态模型,提升自动驾驶场景理解能力 spatiotemporal multimodal

🔬 支柱六:视频提取与匹配 (Video Extraction) (1 篇)

#题目一句话要点标签🔗
22 Validation of Human Pose Estimation and Human Mesh Recovery for Extracting Clinically Relevant Motion Data from Videos 验证人体姿态估计和人体网格重建技术,用于从视频中提取临床相关运动数据 human mesh recovery

🔬 支柱七:动作重定向 (Motion Retargeting) (1 篇)

#题目一句话要点标签🔗
23 Comp-Attn: Present-and-Align Attention for Compositional Video Generation 提出Comp-Attn,通过Present-and-Align注意力机制解决组合视频生成中的主体呈现和关系对齐问题。 spatial relationship latent optimization

⬅️ 返回 cs.CV 首页 · 🏠 返回主页