cs.CV(2025-05-07)

📊 共 23 篇论文 | 🔗 6 篇有代码

🎯 兴趣领域导航

支柱二:RL算法与架构 (RL & Architecture) (5 🔗1) 支柱九:具身大模型 (Embodied Foundation Models) (5 🔗1) 支柱三:空间感知与语义 (Perception & Semantics) (5 🔗1) 支柱一:机器人控制 (Robot Control) (3 🔗1) 支柱六:视频提取与匹配 (Video Extraction) (2 🔗1) 支柱八:物理动画 (Physics-based Animation) (2 🔗1) 支柱七:动作重定向 (Motion Retargeting) (1)

🔬 支柱二:RL算法与架构 (RL & Architecture) (5 篇)

#题目一句话要点标签🔗
1 EchoInk-R1: Exploring Audio-Visual Reasoning in Multimodal LLMs via Reinforcement Learning EchoInk-R1:利用强化学习增强多模态LLM在音视频推理中的能力 reinforcement learning large language model multimodal
2 Bridging Geometry-Coherent Text-to-3D Generation with Multi-View Diffusion Priors and Gaussian Splatting 提出耦合分数蒸馏(CSD)框架,解决文本到3D生成中的几何一致性问题,并优化3D高斯溅射。 distillation 3D gaussian splatting gaussian splatting
3 Occupancy World Model for Robots 提出RoboOccWorld,用于预测室内机器人场景中的3D occupancy场景演化。 world model
4 WDMamba: When Wavelet Degradation Prior Meets Vision Mamba for Image Dehazing WDMamba:结合小波退化先验与Vision Mamba的图像去雾方法 Mamba
5 SMMT: Siamese Motion Mamba with Self-attention for Thermal Infrared Target Tracking 提出SMMT:Siamese架构融合Motion Mamba与自注意力,提升热红外目标跟踪性能 Mamba

🔬 支柱九:具身大模型 (Embodied Foundation Models) (5 篇)

#题目一句话要点标签🔗
6 On Path to Multimodal Generalist: General-Level and General-Bench 提出General-Level评估框架以推动多模态通用模型的发展 large language model foundation model multimodal
7 OpenVision: A Fully-Open, Cost-Effective Family of Advanced Vision Encoders for Multimodal Learning OpenVision:全开放、高性价比的视觉编码器,用于多模态学习 foundation model multimodal
8 HunyuanCustom: A Multimodal-Driven Architecture for Customized Video Generation HunyuanCustom:一种多模态驱动的定制化视频生成架构 multimodal
9 CAD-Llama: Leveraging Large Language Models for Computer-Aided Design Parametric 3D Model Generation CAD-Llama:利用大型语言模型生成参数化CAD 3D模型 large language model
10 ViDRiP-LLaVA: A Dataset and Benchmark for Diagnostic Reasoning from Pathology Videos ViDRiP-LLaVA:病理视频诊断推理的多模态数据集与基准 multimodal chain-of-thought

🔬 支柱三:空间感知与语义 (Perception & Semantics) (5 篇)

#题目一句话要点标签🔗
11 DeCLIP: Decoupled Learning for Open-Vocabulary Dense Perception DeCLIP:解耦学习用于开放词汇密集感知,提升局部判别性和空间一致性。 open-vocabulary open vocabulary foundation model
12 Lay-Your-Scene: Natural Scene Layout Generation with Diffusion Transformers 提出LayouSyn,一种基于扩散Transformer的自然场景布局生成方法,提升可控图像生成能力。 open-vocabulary open vocabulary large language model
13 One2Any: One-Reference 6D Pose Estimation for Any Object 提出One2Any,仅用单张参考图实现任意物体的6D位姿估计。 6D pose estimation
14 Lightweight RGB-D Salient Object Detection from a Speed-Accuracy Tradeoff Perspective 提出速度-精度权衡网络SATNet,用于轻量级RGB-D显著性目标检测。 Depth Anything
15 RAFT -- A Domain Adaptation Framework for RGB & LiDAR Semantic Segmentation 提出RAFT框架,通过数据增强和主动学习,提升RGB-LiDAR语义分割的域适应性能。 scene understanding

🔬 支柱一:机器人控制 (Robot Control) (3 篇)

#题目一句话要点标签🔗
16 Vision-Language-Action Models: Concepts, Progress, Applications and Challenges 综述性论文:Vision-Language-Action模型进展、应用与挑战 humanoid humanoid robot cross-embodiment
17 DATA: Multi-Disentanglement based Contrastive Learning for Open-World Semi-Supervised Deepfake Attribution 提出基于多重解耦对比学习的DATA框架,用于开放世界半监督Deepfake溯源。 manipulation contrastive learning
18 Web2Grasp: Learning Functional Grasps from Web Images of Hand-Object Interactions Web2Grasp:从网络图像学习功能性抓取,提升机器人操作能力 sim-to-real HOI

🔬 支柱六:视频提取与匹配 (Video Extraction) (2 篇)

#题目一句话要点标签🔗
19 Object-Shot Enhanced Grounding Network for Egocentric Video 提出OSGNet,通过对象和视角信息增强第一人称视频的定位能力 egocentric
20 FoodTrack: Estimating Handheld Food Portions with Egocentric Video FoodTrack:利用第一人称视角视频估计手持食物的份量 egocentric

🔬 支柱八:物理动画 (Physics-based Animation) (2 篇)

#题目一句话要点标签🔗
21 SToLa: Self-Adaptive Touch-Language Framework with Tactile Commonsense Reasoning in Open-Ended Scenarios 提出SToLa框架,解决开放场景下触觉常识推理难题 interactive character multimodal
22 HDiffTG: A Lightweight Hybrid Diffusion-Transformer-GCN Architecture for 3D Human Pose Estimation HDiffTG:轻量混合扩散-Transformer-GCN的3D人体姿态估计方法 spatiotemporal

🔬 支柱七:动作重定向 (Motion Retargeting) (1 篇)

#题目一句话要点标签🔗
23 LSVG: Language-Guided Scene Graphs with 2D-Assisted Multi-Modal Encoding for 3D Visual Grounding 提出LSVG框架,利用语言引导的场景图和2D辅助多模态编码进行3D视觉定位 spatial relationship visual grounding

⬅️ 返回 cs.CV 首页 · 🏠 返回主页