cs.CV(2025-05-05)

📊 共 24 篇论文 | 🔗 5 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (13 🔗3) 支柱二:RL算法与架构 (RL & Architecture) (4) 支柱三:空间感知与语义 (Perception & Semantics) (4 🔗1) 支柱一:机器人控制 (Robot Control) (2 🔗1) 支柱四:生成式动作 (Generative Motion) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (13 篇)

#题目一句话要点标签🔗
1 Multimodal Deep Learning for Stroke Prediction and Detection using Retinal Imaging and Clinical Data 提出基于视网膜影像和临床数据的多模态深度学习方法,用于卒中预测和检测。 foundation model multimodal
2 AOR: Anatomical Ontology-Guided Reasoning for Medical Large Multimodal Model in Chest X-Ray Interpretation 提出AOR框架,利用解剖学知识增强医学大模型在胸部X光片解读中的推理能力。 multimodal
3 GAME: Learning Multimodal Interactions via Graph Structures for Personality Trait Estimation 提出GAME:通过图结构学习多模态交互,用于性格特质估计 multimodal
4 DeepSparse: A Foundation Model for Sparse-View CBCT Reconstruction DeepSparse:用于稀疏视角CBCT重建的基石模型,提升重建质量并降低辐射剂量。 foundation model
5 Detect, Classify, Act: Categorizing Industrial Anomalies with Multi-Modal Large Language Models 提出VELM:利用多模态大语言模型进行工业异常分类,提升异常检测的实用性。 large language model
6 Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities 综述统一多模态理解与生成模型,分析架构范式、挑战与机遇,为未来研究提供指导。 multimodal
7 Ming-Lite-Uni: Advancements in Unified Architecture for Natural Multimodal Interaction Ming-Lite-Uni:统一视觉生成器和多模态自回归模型,实现自然多模态交互 multimodal
8 Timing Is Everything: Finding the Optimal Fusion Points in Multimodal Medical Imaging 提出基于序列前向搜索的多模态医学影像融合点优化方法,提升诊断精度。 multimodal
9 Uncertainty-Weighted Image-Event Multimodal Fusion for Video Anomaly Detection 提出基于不确定性加权图像-事件多模态融合的视频异常检测方法 multimodal
10 Using Knowledge Graphs to harvest datasets for efficient CLIP model training 利用知识图谱增强数据收集,高效训练CLIP模型 foundation model
11 RGBX-DiffusionDet: A Framework for Multi-Modal RGB-X Object Detection Using DiffusionDet 提出RGBX-DiffusionDet,利用扩散模型融合RGB图像与异构2D数据进行目标检测。 multimodal
12 Recent Advances in Out-of-Distribution Detection with CLIP-Like Models: A Survey 基于CLIP模型的OOD检测综述:提出图像-文本双模态视角下的新分类框架 multimodal
13 TeDA: Boosting Vision-Lanuage Models for Zero-Shot 3D Object Retrieval via Testing-time Distribution Alignment 提出TeDA,通过测试时分布对齐提升视觉-语言模型在零样本3D物体检索中的性能 multimodal

🔬 支柱二:RL算法与架构 (RL & Architecture) (4 篇)

#题目一句话要点标签🔗
14 R1-Reward: Training Multimodal Reward Model Through Stable Reinforcement Learning 提出StableReinforce算法,提升多模态奖励模型长期推理能力与训练稳定性。 reinforcement learning reward design large language model
15 VAEmo: Efficient Representation Learning for Visual-Audio Emotion with Knowledge Injection VAEmo:通过知识注入高效学习视觉-听觉情感表征,提升AVER性能。 representation learning contrastive learning large language model
16 Text to Image Generation and Editing: A Survey 全面综述文本到图像生成与编辑技术,洞察未来发展方向 Mamba classifier-free guidance foundation model
17 Learning 3D Persistent Embodied World Models 提出具有持久记忆的具身世界模型,用于一致性长时程规划。 policy learning world model

🔬 支柱三:空间感知与语义 (Perception & Semantics) (4 篇)

#题目一句话要点标签🔗
18 Advancing Generalizable Tumor Segmentation with Anomaly-Aware Open-Vocabulary Attention Maps and Frozen Foundation Diffusion Models DiffuGTS:利用异常感知开放词汇注意力图和冻结扩散模型实现通用肿瘤分割 open-vocabulary open vocabulary
19 VGLD: Visually-Guided Linguistic Disambiguation for Monocular Depth Scale Recovery 提出VGLD框架,通过视觉引导的语言消歧实现单目深度尺度恢复 depth estimation monocular depth metric depth
20 6D Pose Estimation on Spoons and Hands 提出基于视频对象分割的6D姿态估计系统,用于追踪用餐时手和勺子的运动 6D pose estimation
21 DELTA: Dense Depth from Events and LiDAR using Transformer's Attention DELTA:利用Transformer注意力机制融合事件相机与激光雷达数据,实现高精度稠密深度估计。 depth estimation

🔬 支柱一:机器人控制 (Robot Control) (2 篇)

#题目一句话要点标签🔗
22 MetaScenes: Towards Automated Replica Creation for Real-world 3D Scans MetaScenes:提出一种自动化的真实世界3D扫描副本创建方法,用于具身智能研究。 manipulation sim-to-real embodied AI
23 Sim2Real in endoscopy segmentation with a novel structure aware image translation 提出一种结构感知图像转换方法,用于内窥镜图像分割中的Sim2Real问题。 sim2real

🔬 支柱四:生成式动作 (Generative Motion) (1 篇)

#题目一句话要点标签🔗
24 Scenethesis: A Language and Vision Agentic Framework for 3D Scene Generation Scenethesis:基于语言和视觉Agent的3D场景生成框架 physically plausible penetration embodied AI

⬅️ 返回 cs.CV 首页 · 🏠 返回主页