cs.CV(2025-04-11)

📊 共 28 篇论文 | 🔗 7 篇有代码

🎯 兴趣领域导航

支柱二:RL算法与架构 (RL & Architecture) (7 🔗1) 支柱一:机器人控制 (Robot Control) (6 🔗2) 支柱九:具身大模型 (Embodied Foundation Models) (6 🔗2) 支柱三:空间感知与语义 (Perception & Semantics) (5) 支柱七:动作重定向 (Motion Retargeting) (2 🔗1) 支柱四:生成式动作 (Generative Motion) (1) 支柱六:视频提取与匹配 (Video Extraction) (1 🔗1)

🔬 支柱二:RL算法与架构 (RL & Architecture) (7 篇)

#题目一句话要点标签🔗
1 Multimodal Knowledge Distillation for Egocentric Action Recognition Robust to Missing Modalities 提出KARMMA,一种鲁棒于模态缺失的自中心动作识别多模态知识蒸馏方法 distillation egocentric multimodal
2 DSM: Constructing a Diverse Semantic Map for 3D Visual Grounding 提出多样化语义地图DSM,用于提升3D视觉定位性能 world model semantic map affordance
3 MotionDreamer: One-to-Many Motion Synthesis with Localized Generative Masked Transformer MotionDreamer:基于局部生成掩码Transformer的单参考动作到多实例动作合成 dreamer motion synthesis
4 Discriminator-Free Direct Preference Optimization for Video Diffusion 提出一种无判别器的视频扩散直接偏好优化方法,解决视频生成中的数据低效和评估不确定性问题。 DPO direct preference optimization
5 Muon-Accelerated Attention Distillation for Real-Time Edge Synthesis via Optimized Latent Diffusion 提出Muon-AD框架,加速边缘设备上潜在扩散模型的实时合成。 curriculum learning distillation
6 Embodied Image Captioning: Self-supervised Learning Agents for Spatially Coherent Image Descriptions 提出一种自监督具身图像描述方法,提升智能体在复杂环境中生成空间一致性描述的能力。 contrastive learning large language model
7 MineWorld: a Real-Time and Open-Source Interactive World Model on Minecraft MineWorld:Minecraft上实时开源交互式世界模型,基于视觉-动作自回归Transformer world model

🔬 支柱一:机器人控制 (Robot Control) (6 篇)

#题目一句话要点标签🔗
8 FMLGS: Fast Multilevel Language Embedded Gaussians for Part-level Interactive Agents 提出FMLGS,加速3D高斯溅射中零件级交互式Agent构建与查询。 manipulation 3D gaussian splatting 3DGS
9 MBE-ARI: A Multimodal Dataset Mapping Bi-directional Engagement in Animal-Robot Interaction 提出MBE-ARI多模态数据集,促进动物-机器人交互中的双向沟通研究。 quadruped legged robot multimodal
10 Training-free Guidance in Text-to-Video Generation via Multimodal Planning and Structured Noise Initialization 提出Video-MSG以解决文本到视频生成中的布局控制问题 manipulation multimodal
11 Latent Diffusion Autoencoders: Toward Efficient and Meaningful Unsupervised Representation Learning in Medical Imaging 提出潜空间扩散自编码器(LDAE),用于高效且有意义的医学图像无监督表征学习。 manipulation representation learning MAE
12 A Knowledge-guided Adversarial Defense for Resisting Malicious Visual Manipulation 提出知识引导的对抗防御(KGAD)以抵抗恶意视觉篡改 manipulation
13 Multi-person Physics-based Pose Estimation for Combat Sports 提出基于物理的多人姿态估计框架,用于提升格斗运动场景下的3D姿态估计精度。 trajectory optimization

🔬 支柱九:具身大模型 (Embodied Foundation Models) (6 篇)

#题目一句话要点标签🔗
14 Robust SAM: On the Adversarial Robustness of Vision Foundation Models 提出对抗鲁棒性框架,提升SAM在不同提示下的防御能力和精度平衡 foundation model
15 Visual Chronicles: Using Multimodal LLMs to Analyze Massive Collections of Images 提出一种基于多模态LLM的系统,用于分析大规模图像集合中的时序变化趋势。 multimodal
16 Seaweed-7B: Cost-Effective Training of Video Generation Foundation Model Seaweed-7B:一种高性价比的视频生成基础模型训练方法 foundation model
17 LMM4LMM: Benchmarking and Evaluating Large-multimodal Image Generation with LMMs 提出LMM4LMM,一种基于LMM的图像生成自动评估指标与基准数据集EvalMi-50K。 multimodal
18 VLMT: Vision-Language Multimodal Transformer for Multimodal Multi-hop Question Answering 提出VLMT,用于解决多模态多跳问答中跨模态推理能力不足的问题。 multimodal
19 F$^3$Set: Towards Analyzing Fast, Frequent, and Fine-grained Events from Videos 提出F$^3$Set基准数据集,用于分析视频中快速、频繁和细粒度的事件,并提出F$^3$ED模型。 TAMP

🔬 支柱三:空间感知与语义 (Perception & Semantics) (5 篇)

#题目一句话要点标签🔗
20 Cut-and-Splat: Leveraging Gaussian Splatting for Synthetic Data Generation 利用高斯溅射进行合成数据生成,提升实例分割模型训练效果 depth estimation monocular depth gaussian splatting
21 HAL-NeRF: High Accuracy Localization Leveraging Neural Radiance Fields HAL-NeRF:利用神经辐射场实现高精度相机定位 NeRF neural radiance field
22 Parameter-Free Fine-tuning via Redundancy Elimination for Vision Foundation Models 提出基于冗余消除的免参数微调方法,用于视觉基础模型适应下游任务 depth estimation foundation model
23 Hardware, Algorithms, and Applications of the Neuromorphic Vision Sensor: a Review 综述:神经形态视觉传感器的硬件、算法与应用 optical flow
24 GeoTexBuild: 3D Building Model Generation from Map Footprints GeoTexBuild:提出一种从地图轮廓生成3D建筑模型的新框架 height map

🔬 支柱七:动作重定向 (Motion Retargeting) (2 篇)

#题目一句话要点标签🔗
25 Geometric Consistency Refinement for Single Image Novel View Synthesis via Test-Time Adaptation of Diffusion Models 提出基于测试时自适应扩散模型的几何一致性优化方法,提升单图新视角合成质量 geometric consistency
26 RealCam-Vid: High-resolution Video Dataset with Dynamic Scenes and Metric-scale Camera Movements 提出RealCam-Vid:首个具有动态场景和度量尺度相机运动的高分辨率视频数据集。 geometric consistency

🔬 支柱四:生成式动作 (Generative Motion) (1 篇)

#题目一句话要点标签🔗
27 Ego4o: Egocentric Human Motion Capture and Understanding from Multi-Modal Input 提出Ego4o框架以解决多模态人类动作捕捉问题 VQ-VAE egocentric

🔬 支柱六:视频提取与匹配 (Video Extraction) (1 篇)

#题目一句话要点标签🔗
28 The Invisible EgoHand: 3D Hand Forecasting through EgoBody Pose Estimation 提出EgoH4以解决手部姿态预测中的可见性限制问题 egocentric

⬅️ 返回 cs.CV 首页 · 🏠 返回主页