cs.CV(2025-03-04)
📊 共 22 篇论文 | 🔗 8 篇有代码
🎯 兴趣领域导航
支柱九:具身大模型 (Embodied Foundation Models) (6 🔗3)
支柱二:RL算法与架构 (RL & Architecture) (4 🔗1)
支柱三:空间感知与语义 (Perception & Semantics) (3 🔗1)
支柱四:生成式动作 (Generative Motion) (3)
支柱八:物理动画 (Physics-based Animation) (2 🔗1)
支柱五:交互与反应 (Interaction & Reaction) (1 🔗1)
支柱七:动作重定向 (Motion Retargeting) (1 🔗1)
支柱一:机器人控制 (Robot Control) (1)
支柱六:视频提取与匹配 (Video Extraction) (1)
🔬 支柱九:具身大模型 (Embodied Foundation Models) (6 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 1 | A Token-level Text Image Foundation Model for Document Understanding | 提出TokenOCR:面向文档理解的Token级文本图像基础模型 | large language model foundation model | ✅ | |
| 2 | Multimodal Deep Learning for Subtype Classification in Breast Cancer Using Histopathological Images and Gene Expression Data | 提出多模态深度学习框架以解决乳腺癌亚型分类问题 | multimodal | ||
| 3 | SPIDER: A Comprehensive Multi-Organ Supervised Pathology Dataset and Baseline Models | SPIDER:构建多器官病理图像数据集并提出基线模型,促进AI病理学研究 | foundation model multimodal | ✅ | |
| 4 | BioD2C: A Dual-level Semantic Consistency Constraint Framework for Biomedical VQA | BioD2C:双层语义一致性约束框架,提升生物医学VQA性能 | large language model multimodal | ||
| 5 | CADDI: An in-Class Activity Detection Dataset using IMU data from low-cost sensors | CADDI:提出一个基于低成本IMU的课堂活动检测数据集,促进教育场景下的活动识别。 | multimodal | ||
| 6 | StageDesigner: Artistic Stage Generation for Scenography via Theater Scripts | StageDesigner:利用剧本生成艺术化舞台场景的综合框架 | large language model | ✅ |
🔬 支柱二:RL算法与架构 (RL & Architecture) (4 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 7 | Developing a PET/CT Foundation Model for Cross-Modal Anatomical and Functional Imaging | 提出Cross-Fraternal Twin Masked Autoencoder,用于PET/CT跨模态解剖和功能成像 | representation learning masked autoencoder foundation model | ||
| 8 | LLaVE: Large Language and Vision Embedding Models with Hardness-Weighted Contrastive Learning | LLaVE:基于难度加权对比学习的大型语言-视觉嵌入模型,实现SOTA性能。 | representation learning contrastive learning multimodal | ||
| 9 | SSNet: Saliency Prior and State Space Model-based Network for Salient Object Detection in RGB-D Images | 提出基于显著性先验和状态空间模型的SSNet,用于RGB-D图像的显著性目标检测。 | SSM state space model scene understanding | ||
| 10 | WMNav: Integrating Vision-Language Models into World Models for Object Goal Navigation | WMNav:融合视觉-语言模型与世界模型的物体目标导航框架 | world model embodied AI | ✅ |
🔬 支柱三:空间感知与语义 (Perception & Semantics) (3 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 11 | 2DGS-Avatar: Animatable High-fidelity Clothed Avatar via 2D Gaussian Splatting | 提出2DGS-Avatar,通过2D高斯溅射实现高保真可动画的服装人像实时渲染。 | 3D gaussian splatting 3DGS gaussian splatting | ||
| 12 | Resource-Efficient Affordance Grounding with Complementary Depth and Semantic Prompts | 提出BiT-Align框架,利用互补深度和语义提示提升资源受限下的可供性推理性能。 | affordance multimodal | ✅ | |
| 13 | Label-Efficient LiDAR Panoptic Segmentation | 提出L3PS,利用少量标注数据实现高效LiDAR全景分割 | scene understanding |
🔬 支柱四:生成式动作 (Generative Motion) (3 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 14 | SPG: Improving Motion Diffusion by Smooth Perturbation Guidance | SPG:通过平滑扰动引导提升运动扩散模型的生成质量 | motion diffusion model motion diffusion | ||
| 15 | ARC-Flow : Articulated, Resolution-Agnostic, Correspondence-Free Matching and Interpolation of 3D Shapes Under Flow Fields | 提出ARC-Flow,通过流场实现铰接3D形状的无对应关系匹配与插值。 | physically plausible | ||
| 16 | Efficient Training-Free High-Resolution Synthesis with Energy Rectification in Diffusion Models | 提出RectifiedHR,一种高效无训练的扩散模型高分辨率图像合成方法 | classifier-free guidance |
🔬 支柱八:物理动画 (Physics-based Animation) (2 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 17 | MM-OR: A Large Multimodal Operating Room Dataset for Semantic Understanding of High-Intensity Surgical Environments | 提出MM-OR手术室多模态数据集与MM2SG模型,用于提升高强度手术环境的语义理解。 | spatiotemporal multimodal | ✅ | |
| 18 | TReND: Transformer derived features and Regularized NMF for neonatal functional network Delineation | 提出TReND框架,利用Transformer和正则化NMF进行新生儿功能网络划分 | spatiotemporal |
🔬 支柱五:交互与反应 (Interaction & Reaction) (1 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 19 | Seeing is Understanding: Unlocking Causal Attention into Modality-Mutual Attention for Multimodal LLMs | 提出MapleLeaf AKI,通过解耦因果注意力实现多模态LLM的模态互注意力。 | mutual attention large language model foundation model | ✅ |
🔬 支柱七:动作重定向 (Motion Retargeting) (1 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 20 | CMMLoc: Advancing Text-to-PointCloud Localization with Cauchy-Mixture-Model Based Framework | CMMLoc:基于柯西混合模型的文本到点云定位框架 | spatial relationship | ✅ |
🔬 支柱一:机器人控制 (Robot Control) (1 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 21 | Monocular Person Localization under Camera Ego-motion | 提出基于单目相机运动的四点人体模型定位方法,提升人机交互中定位精度 | quadruped |
🔬 支柱六:视频提取与匹配 (Video Extraction) (1 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 22 | mmDEAR: mmWave Point Cloud Density Enhancement for Accurate Human Body Reconstruction | 提出mmDEAR框架,增强毫米波点云密度,提升人体重建精度 | SMPL |