cs.CV(2025-05-07)
📊 共 23 篇论文 | 🔗 6 篇有代码
🎯 兴趣领域导航
支柱二:RL算法与架构 (RL & Architecture) (5 🔗1)
支柱九:具身大模型 (Embodied Foundation Models) (5 🔗1)
支柱三:空间感知与语义 (Perception & Semantics) (5 🔗1)
支柱一:机器人控制 (Robot Control) (3 🔗1)
支柱六:视频提取与匹配 (Video Extraction) (2 🔗1)
支柱八:物理动画 (Physics-based Animation) (2 🔗1)
支柱七:动作重定向 (Motion Retargeting) (1)
🔬 支柱二:RL算法与架构 (RL & Architecture) (5 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 1 | EchoInk-R1: Exploring Audio-Visual Reasoning in Multimodal LLMs via Reinforcement Learning | EchoInk-R1:利用强化学习增强多模态LLM在音视频推理中的能力 | reinforcement learning large language model multimodal | ||
| 2 | Bridging Geometry-Coherent Text-to-3D Generation with Multi-View Diffusion Priors and Gaussian Splatting | 提出耦合分数蒸馏(CSD)框架,解决文本到3D生成中的几何一致性问题,并优化3D高斯溅射。 | distillation 3D gaussian splatting gaussian splatting | ||
| 3 | Occupancy World Model for Robots | 提出RoboOccWorld,用于预测室内机器人场景中的3D occupancy场景演化。 | world model | ||
| 4 | WDMamba: When Wavelet Degradation Prior Meets Vision Mamba for Image Dehazing | WDMamba:结合小波退化先验与Vision Mamba的图像去雾方法 | Mamba | ✅ | |
| 5 | SMMT: Siamese Motion Mamba with Self-attention for Thermal Infrared Target Tracking | 提出SMMT:Siamese架构融合Motion Mamba与自注意力,提升热红外目标跟踪性能 | Mamba |
🔬 支柱九:具身大模型 (Embodied Foundation Models) (5 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 6 | On Path to Multimodal Generalist: General-Level and General-Bench | 提出General-Level评估框架以推动多模态通用模型的发展 | large language model foundation model multimodal | ||
| 7 | OpenVision: A Fully-Open, Cost-Effective Family of Advanced Vision Encoders for Multimodal Learning | OpenVision:全开放、高性价比的视觉编码器,用于多模态学习 | foundation model multimodal | ||
| 8 | HunyuanCustom: A Multimodal-Driven Architecture for Customized Video Generation | HunyuanCustom:一种多模态驱动的定制化视频生成架构 | multimodal | ||
| 9 | CAD-Llama: Leveraging Large Language Models for Computer-Aided Design Parametric 3D Model Generation | CAD-Llama:利用大型语言模型生成参数化CAD 3D模型 | large language model | ||
| 10 | ViDRiP-LLaVA: A Dataset and Benchmark for Diagnostic Reasoning from Pathology Videos | ViDRiP-LLaVA:病理视频诊断推理的多模态数据集与基准 | multimodal chain-of-thought | ✅ |
🔬 支柱三:空间感知与语义 (Perception & Semantics) (5 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 11 | DeCLIP: Decoupled Learning for Open-Vocabulary Dense Perception | DeCLIP:解耦学习用于开放词汇密集感知,提升局部判别性和空间一致性。 | open-vocabulary open vocabulary foundation model | ✅ | |
| 12 | Lay-Your-Scene: Natural Scene Layout Generation with Diffusion Transformers | 提出LayouSyn,一种基于扩散Transformer的自然场景布局生成方法,提升可控图像生成能力。 | open-vocabulary open vocabulary large language model | ||
| 13 | One2Any: One-Reference 6D Pose Estimation for Any Object | 提出One2Any,仅用单张参考图实现任意物体的6D位姿估计。 | 6D pose estimation | ||
| 14 | Lightweight RGB-D Salient Object Detection from a Speed-Accuracy Tradeoff Perspective | 提出速度-精度权衡网络SATNet,用于轻量级RGB-D显著性目标检测。 | Depth Anything | ||
| 15 | RAFT -- A Domain Adaptation Framework for RGB & LiDAR Semantic Segmentation | 提出RAFT框架,通过数据增强和主动学习,提升RGB-LiDAR语义分割的域适应性能。 | scene understanding |
🔬 支柱一:机器人控制 (Robot Control) (3 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 16 | Vision-Language-Action Models: Concepts, Progress, Applications and Challenges | 综述性论文:Vision-Language-Action模型进展、应用与挑战 | humanoid humanoid robot cross-embodiment | ||
| 17 | DATA: Multi-Disentanglement based Contrastive Learning for Open-World Semi-Supervised Deepfake Attribution | 提出基于多重解耦对比学习的DATA框架,用于开放世界半监督Deepfake溯源。 | manipulation contrastive learning | ||
| 18 | Web2Grasp: Learning Functional Grasps from Web Images of Hand-Object Interactions | Web2Grasp:从网络图像学习功能性抓取,提升机器人操作能力 | sim-to-real HOI | ✅ |
🔬 支柱六:视频提取与匹配 (Video Extraction) (2 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 19 | Object-Shot Enhanced Grounding Network for Egocentric Video | 提出OSGNet,通过对象和视角信息增强第一人称视频的定位能力 | egocentric | ✅ | |
| 20 | FoodTrack: Estimating Handheld Food Portions with Egocentric Video | FoodTrack:利用第一人称视角视频估计手持食物的份量 | egocentric |
🔬 支柱八:物理动画 (Physics-based Animation) (2 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 21 | SToLa: Self-Adaptive Touch-Language Framework with Tactile Commonsense Reasoning in Open-Ended Scenarios | 提出SToLa框架,解决开放场景下触觉常识推理难题 | interactive character multimodal | ||
| 22 | HDiffTG: A Lightweight Hybrid Diffusion-Transformer-GCN Architecture for 3D Human Pose Estimation | HDiffTG:轻量混合扩散-Transformer-GCN的3D人体姿态估计方法 | spatiotemporal | ✅ |
🔬 支柱七:动作重定向 (Motion Retargeting) (1 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 23 | LSVG: Language-Guided Scene Graphs with 2D-Assisted Multi-Modal Encoding for 3D Visual Grounding | 提出LSVG框架,利用语言引导的场景图和2D辅助多模态编码进行3D视觉定位 | spatial relationship visual grounding |