cs.CV(2024-06-07)
📊 共 26 篇论文 | 🔗 10 篇有代码
🎯 兴趣领域导航
支柱九:具身大模型 (Embodied Foundation Models) (7 🔗5)
支柱三:空间感知与语义 (Perception & Semantics) (6 🔗3)
支柱二:RL算法与架构 (RL & Architecture) (5)
支柱一:机器人控制 (Robot Control) (3)
支柱七:动作重定向 (Motion Retargeting) (2 🔗1)
支柱五:交互与反应 (Interaction & Reaction) (1 🔗1)
支柱六:视频提取与匹配 (Video Extraction) (1)
支柱八:物理动画 (Physics-based Animation) (1)
🔬 支柱九:具身大模型 (Embodied Foundation Models) (7 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 1 | Towards Semantic Equivalence of Tokenization in Multimodal LLM | 提出动态语义等价视觉Token化方法SeTok,提升多模态大语言模型性能 | large language model multimodal | ✅ | |
| 2 | MGIMM: Multi-Granularity Instruction Multimodal Model for Attribute-Guided Remote Sensing Image Detailed Description | 提出MGIMM,通过多粒度指令学习实现遥感图像属性引导的详细描述生成。 | large language model multimodal | ✅ | |
| 3 | VISTA3D: A Unified Segmentation Foundation Model For 3D Medical Imaging | VISTA3D:用于3D医学影像的统一分割基础模型 | foundation model | ✅ | |
| 4 | RU-AI: A Large Multimodal Dataset for Machine-Generated Content Detection | 提出RU-AI:一个大规模多模态数据集,用于检测机器生成内容 | multimodal | ✅ | |
| 5 | Interpretable Multimodal Out-of-context Detection with Soft Logic Regularization | 提出LOGRAN,利用软逻辑正则化实现可解释的多模态语境外信息检测。 | multimodal | ||
| 6 | LocLLM: Exploiting Generalizable Human Keypoint Localization via Large Language Model | 提出LocLLM,利用大语言模型实现更通用的基于文本描述的人体关键点定位 | large language model | ||
| 7 | Predictive Dynamic Fusion | 提出预测动态融合框架,解决多模态融合中的不稳定性问题。 | multimodal | ✅ |
🔬 支柱三:空间感知与语义 (Perception & Semantics) (6 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 8 | USE: Universal Segment Embeddings for Open-Vocabulary Image Segmentation | 提出通用分割嵌入USE框架,解决开放词汇图像分割中的精确分类问题 | open-vocabulary open vocabulary foundation model | ||
| 9 | OVMR: Open-Vocabulary Recognition with Multi-Modal References | 提出OVMR,利用多模态参考信息实现开放词汇识别 | open-vocabulary open vocabulary | ✅ | |
| 10 | Composition Vision-Language Understanding via Segment and Depth Anything Model | 提出深度与分割模型融合以增强视觉语言理解 | Depth Anything multimodal | ✅ | |
| 11 | Multi-style Neural Radiance Field with AdaIN | 提出结合AdaIN和NeRF的多风格神经辐射场,用于风格化新视角合成 | NeRF neural radiance field | ||
| 12 | Normal-guided Detail-Preserving Neural Implicit Function for High-Fidelity 3D Surface Reconstruction | 提出法线引导的神经隐函数,用于高保真三维表面重建,尤其适用于稀疏视图场景。 | monocular depth implicit representation | ||
| 13 | Ada-VE: Training-Free Consistent Video Editing Using Adaptive Motion Prior | 提出自适应运动先验以解决视频编辑一致性问题 | optical flow | ✅ |
🔬 支柱二:RL算法与架构 (RL & Architecture) (5 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 14 | STAR: Skeleton-aware Text-based 4D Avatar Generation with In-Network Motion Retargeting | STAR:提出骨骼感知的文本驱动4D Avatar生成方法,实现网络内运动重定向。 | distillation motion retargeting | ||
| 15 | Efficient 3D Shape Generation via Diffusion Mamba with Bidirectional SSMs | 提出Diffusion Mamba (DiM-3D)模型,高效生成高分辨率3D形状,解决传统扩散模型计算瓶颈。 | Mamba SSM | ||
| 16 | Joint Spatial-Temporal Modeling and Contrastive Learning for Self-supervised Heart Rate Measurement | 提出时空建模与对比学习相结合的自监督心率测量方法,在RePSS Challenge中获得第二名。 | contrastive learning | ||
| 17 | MA-AVT: Modality Alignment for Parameter-Efficient Audio-Visual Transformers | MA-AVT:提出一种参数高效的音视频Transformer,通过模态对齐提升性能。 | contrastive learning multimodal | ||
| 18 | Attention Fusion Reverse Distillation for Multi-Lighting Image Anomaly Detection | 提出注意力融合反向蒸馏(AFRD)方法,解决多光照图像异常检测问题。 | distillation |
🔬 支柱一:机器人控制 (Robot Control) (3 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 19 | Training-Free Video Editing via Optical Flow-Enhanced Score Distillation | 提出基于光流增强Score Distillation的免训练视频编辑方法 | manipulation distillation optical flow | ||
| 20 | 3D-GRAND: A Million-Scale Dataset for 3D-LLMs with Better Grounding and Less Hallucination | 提出3D-GRAND数据集,提升3D-LLM的场景理解能力并减少幻觉 | sim-to-real embodied AI large language model | ||
| 21 | Varying Manifolds in Diffusion: From Time-varying Geometries to Visual Saliency | 提出基于生成率的扩散模型几何分析方法,实现图像显著性操控及多种图像编辑任务。 | manipulation |
🔬 支柱七:动作重定向 (Motion Retargeting) (2 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 22 | Diving Deep into the Motion Representation of Video-Text Models | 利用GPT-4生成细粒度运动描述,提升视频-文本模型对视频运动的理解能力 | motion representation | ||
| 23 | SMC++: Masked Learning of Unsupervised Video Semantic Compression | 提出基于掩码学习的无监督视频语义压缩框架SMC++,提升视频分析任务性能 | motion prediction | ✅ |
🔬 支柱五:交互与反应 (Interaction & Reaction) (1 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 24 | SMART: Scene-motion-aware human action recognition framework for mental disorder group | 针对精神障碍患者,提出场景-运动感知的行为识别框架SMART,用于智能医疗视频监控。 | human-scene interaction human motion | ✅ |
🔬 支柱六:视频提取与匹配 (Video Extraction) (1 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 25 | ProMotion: Prototypes As Motion Learners | ProMotion:提出基于原型学习的统一运动建模框架,提升多种运动任务性能 | feature matching motion representation |
🔬 支柱八:物理动画 (Physics-based Animation) (1 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 26 | Semantic Segmentation on VSPW Dataset through Masked Video Consistency | 提出基于掩码视频一致性的语义分割方法,提升VSPW数据集性能。 | spatiotemporal multimodal |