cs.CV(2025-02-05)

📊 共 18 篇论文 | 🔗 5 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (8 🔗4) 支柱三:空间感知与语义 (Perception & Semantics) (3) 支柱八:物理动画 (Physics-based Animation) (2) 支柱二:RL算法与架构 (RL & Architecture) (2) 支柱五:交互与反应 (Interaction & Reaction) (1) 支柱七:动作重定向 (Motion Retargeting) (1 🔗1) 支柱四:生成式动作 (Generative Motion) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (8 篇)

#题目一句话要点标签🔗
1 Can Large Language Models Capture Video Game Engagement? 评估大型语言模型在视频游戏中捕捉玩家参与度的能力 large language model multimodal
2 ZISVFM: Zero-Shot Object Instance Segmentation in Indoor Robotic Environments with Vision Foundation Models ZISVFM:利用视觉基础模型实现室内机器人零样本物体实例分割 foundation model
3 DILLEMA: Diffusion and Large Language Models for Multi-Modal Augmentation DILLEMA:利用扩散模型和大型语言模型进行多模态数据增强,提升深度学习模型鲁棒性 large language model
4 Driver Assistance System Based on Multimodal Data Hazard Detection 提出基于多模态数据融合的驾驶辅助系统,提升驾驶异常事件检测精度。 multimodal
5 RadVLM: A Multitask Conversational Vision-Language Model for Radiology RadVLM:用于放射学的多任务对话式视觉-语言模型 foundation model visual grounding
6 Expertized Caption Auto-Enhancement for Video-Text Retrieval 提出专家化字幕自动增强方法,解决视频-文本检索中信息不匹配问题。 large language model multimodal
7 MaxInfo: A Training-Free Key-Frame Selection Method Using Maximum Volume for Enhanced Video Understanding 提出MaxInfo,一种免训练的关键帧选择方法,提升视频理解能力 large language model
8 Tell2Reg: Establishing spatial correspondence between images by the same language prompts Tell2Reg:利用相同语言提示在图像间建立空间对应关系,实现免训练图像配准 multimodal

🔬 支柱三:空间感知与语义 (Perception & Semantics) (3 篇)

#题目一句话要点标签🔗
9 MetaFE-DE: Learning Meta Feature Embedding for Depth Estimation from Monocular Endoscopic Images 提出MetaFE-DE,利用元特征嵌入解决单目内窥镜图像深度估计难题 depth estimation monocular depth
10 MotionAgent: Fine-grained Controllable Video Generation via Motion Field Agent MotionAgent:通过运动场代理实现细粒度可控的视频生成 optical flow motion generation
11 VistaFlow: Photorealistic Volumetric Reconstruction with Dynamic Resolution Management via Q-Learning VistaFlow:通过Q学习动态管理分辨率,实现逼真的体绘制 NeRF neural radiance field

🔬 支柱八:物理动画 (Physics-based Animation) (2 篇)

#题目一句话要点标签🔗
12 Deep Learning-based Event Data Coding: A Joint Spatiotemporal and Polarity Solution 提出基于深度学习的联合时空极性事件数据编码DL-JEC,实现高效压缩。 spatiotemporal
13 Kronecker Mask and Interpretive Prompts are Language-Action Video Learners 提出CLAVER,通过Kronecker Mask和解释性提示增强CLIP在视频动作识别中的性能。 spatiotemporal large language model

🔬 支柱二:RL算法与架构 (RL & Architecture) (2 篇)

#题目一句话要点标签🔗
14 Masked Autoencoders Are Effective Tokenizers for Diffusion Models 提出MAETok,利用掩码自编码器为扩散模型学习更优的token表示,显著提升图像生成质量和效率。 masked autoencoder
15 Domain-Invariant Per-Frame Feature Extraction for Cross-Domain Imitation Learning with Visual Observations 提出DIFF-IL,用于跨域模仿学习中视觉观测的域不变特征提取 imitation learning

🔬 支柱五:交互与反应 (Interaction & Reaction) (1 篇)

#题目一句话要点标签🔗
16 HSI: A Holistic Style Injector for Arbitrary Style Transfer 提出整体风格注入器HSI,用于解决任意风格迁移中局部失真和计算复杂度高的问题 HSI

🔬 支柱七:动作重定向 (Motion Retargeting) (1 篇)

#题目一句话要点标签🔗
17 Seeing World Dynamics in a Nutshell NutWorld:单目视频高效转换为动态3D高斯表示,实现时空一致性建模 geometric consistency spatiotemporal

🔬 支柱四:生成式动作 (Generative Motion) (1 篇)

#题目一句话要点标签🔗
18 Dress-1-to-3: Single Image to Simulation-Ready 3D Outfit with Diffusion Prior and Differentiable Physics Dress-1-to-3:提出一种从单张图像重建可用于仿真的可分离3D服装方法 motion generation

⬅️ 返回 cs.CV 首页 · 🏠 返回主页