cs.CV(2024-07-30)
📊 共 17 篇论文 | 🔗 2 篇有代码
🎯 兴趣领域导航
支柱九:具身大模型 (Embodied Foundation Models) (5)
支柱二:RL算法与架构 (RL & Architecture) (3)
支柱三:空间感知与语义 (Perception & Semantics) (3 🔗2)
支柱一:机器人控制 (Robot Control) (2)
支柱五:交互与反应 (Interaction & Reaction) (2)
支柱四:生成式动作 (Generative Motion) (1)
支柱六:视频提取与匹配 (Video Extraction) (1)
🔬 支柱九:具身大模型 (Embodied Foundation Models) (5 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 1 | AI Safety in Practice: Enhancing Adversarial Robustness in Multimodal Image Captioning | 提出基于对抗训练的多模态图像描述鲁棒性增强方法 | multimodal | ||
| 2 | MMTrail: A Multimodal Trailer Video Dataset with Language and Music Descriptions | 提出MMTrail:一个包含语言和音乐描述的大规模多模态预告片视频数据集 | multimodal | ||
| 3 | Benchmarking Histopathology Foundation Models for Ovarian Cancer Bevacizumab Treatment Response Prediction from Whole Slide Images | 利用病理学基础模型,从WSI预测卵巢癌贝伐珠单抗治疗反应 | foundation model | ||
| 4 | SynthVLM: Towards High-Quality and Efficient Synthesis of Image-Caption Datasets for Vision-Language Models | SynthVLM:面向视觉-语言模型的高质量高效图像-文本数据集合成 | large language model multimodal | ||
| 5 | Effectively Leveraging CLIP for Generating Situational Summaries of Images and Videos | 提出ClipSitu,利用CLIP有效生成图像和视频的情境摘要,实现卓越的情境识别与定位。 | multimodal |
🔬 支柱二:RL算法与架构 (RL & Architecture) (3 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 6 | CLEFT: Language-Image Contrastive Learning with Efficient Large Language Model and Prompt Fine-Tuning | CLEFT:利用高效大语言模型和提示微调的语言-图像对比学习,提升医学影像任务性能。 | representation learning contrastive learning large language model | ||
| 7 | Prompt-Driven Contrastive Learning for Transferable Adversarial Attacks | 提出PDCL-Attack,利用CLIP模型提升生成模型对抗攻击的迁移性 | contrastive learning foundation model | ||
| 8 | SpotFormer: Multi-Scale Spatio-Temporal Transformer for Facial Expression Spotting | 提出SpotFormer,一种多尺度时空Transformer,用于面部表情定位 | contrastive learning optical flow |
🔬 支柱三:空间感知与语义 (Perception & Semantics) (3 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 9 | Dynamic Scene Understanding through Object-Centric Voxelization and Neural Rendering | DynaVol-S:通过物体中心体素化和神经渲染实现动态场景理解 | NeRF scene understanding | ||
| 10 | NIS-SLAM: Neural Implicit Semantic RGB-D SLAM for 3D Consistent Scene Understanding | NIS-SLAM:神经隐式语义RGB-D SLAM,实现3D一致的场景理解 | implicit representation scene understanding | ✅ | |
| 11 | SceneTeller: Language-to-3D Scene Generation | SceneTeller:提出一种基于文本描述生成高质量3D室内场景的开创性方法 | 3D gaussian splatting gaussian splatting splatting | ✅ |
🔬 支柱一:机器人控制 (Robot Control) (2 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 12 | FACL-Attack: Frequency-Aware Contrastive Learning for Transferable Adversarial Attacks | 提出FACL-Attack,通过频域对比学习增强对抗样本的跨域和跨模型迁移性 | domain randomization contrastive learning | ||
| 13 | WARM-3D: A Weakly-Supervised Sim2Real Domain Adaptation Framework for Roadside Monocular 3D Object Detection | 提出WARM-3D框架,用于解决路侧单目3D目标检测中的Sim2Real域适应问题。 | sim2real |
🔬 支柱五:交互与反应 (Interaction & Reaction) (2 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 14 | Monocular Human-Object Reconstruction in the Wild | 提出一种2D监督方法,用于野外场景下单目人体-物体交互3D重建 | human-object interaction | ||
| 15 | StackFLOW: Monocular Human-Object Reconstruction by Stacked Normalizing Flow with Offset | StackFLOW:利用堆叠归一化流与偏移量进行单目人体-物体三维重建 | human-object interaction |
🔬 支柱四:生成式动作 (Generative Motion) (1 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 16 | MotionCraft: Crafting Whole-Body Motion with Plug-and-Play Multimodal Controls | MotionCraft:提出一种即插即用的多模态控制全身运动生成框架。 | text-to-motion motion generation SMPL |
🔬 支柱六:视频提取与匹配 (Video Extraction) (1 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 17 | EgoSonics: Generating Synchronized Audio for Silent Egocentric Videos | EgoSonics:提出一种为无声第一视角视频生成同步音频的方法 | egocentric |