cs.CV(2024-06-13)
📊 共 51 篇论文 | 🔗 20 篇有代码
🎯 兴趣领域导航
支柱九:具身大模型 (Embodied Foundation Models) (15 🔗10)
支柱三:空间感知与语义 (Perception & Semantics) (15 🔗3)
支柱二:RL算法与架构 (RL & Architecture) (12 🔗4)
支柱六:视频提取与匹配 (Video Extraction) (5 🔗2)
支柱七:动作重定向 (Motion Retargeting) (2 🔗1)
支柱一:机器人控制 (Robot Control) (2)
🔬 支柱九:具身大模型 (Embodied Foundation Models) (15 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 1 | Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models | 提出Visual Sketchpad,赋予多模态语言模型视觉草稿本,提升复杂推理能力 | multimodal chain-of-thought | ✅ | |
| 2 | Towards Vision-Language Geo-Foundation Model: A Survey | 综述性论文:面向视觉-语言地理基础模型(VLGFM)的研究进展与未来方向。 | foundation model multimodal visual grounding | ✅ | |
| 3 | Multiagent Multitraversal Multimodal Self-Driving: Open MARS Dataset | 提出MARS数据集,用于多智能体、多视角、多模态自动驾驶研究。 | multimodal | ✅ | |
| 4 | MMRel: Benchmarking Relation Understanding in Multi-Modal Large Language Models | 提出MMRel基准以解决多模态大语言模型的关系理解问题 | large language model | ✅ | |
| 5 | Dual Attribute-Spatial Relation Alignment for 3D Visual Grounding | 提出DASANet,通过双分支对齐属性-空间关系特征实现更精准的3D视觉定位 | visual grounding | ||
| 6 | MMFakeBench: A Mixed-Source Multimodal Misinformation Detection Benchmark for LVLMs | 提出MMFakeBench:一个面向LVLM的混合源多模态虚假信息检测基准 | multimodal | ||
| 7 | VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding | VideoGPT+:融合图像与视频编码器,提升视频理解能力 | large language model multimodal | ✅ | |
| 8 | Explore the Limits of Omni-modal Pretraining at Scale | 提出MiCo,一种可扩展的通用多模态预训练框架,显著提升多模态理解能力。 | large language model multimodal | ✅ | |
| 9 | Needle In A Video Haystack: A Scalable Synthetic Evaluator for Video MLLMs | VideoNIAH:一种可扩展的视频MLLM合成评估器,用于解决视频理解模型评估难题。 | large language model multimodal | ✅ | |
| 10 | Comparison Visual Instruction Tuning | 提出CaD-VI框架与CaD-Inst数据集,提升LMMs在图像对比任务中的性能。 | multimodal instruction following | ||
| 11 | INS-MMBench: A Comprehensive Benchmark for Evaluating LVLMs' Performance in Insurance | INS-MMBench:首个保险领域多模态大模型综合评测基准,覆盖22项基础任务。 | large language model multimodal | ✅ | |
| 12 | Parameter-Efficient Active Learning for Foundational models | 提出参数高效主动学习框架,提升基础模型在小样本图像分类任务中的性能 | foundation model | ||
| 13 | Language-driven Grasp Detection | 提出基于扩散模型的语言驱动抓取检测方法,并构建大规模数据集 Grasp-Anything++。 | foundation model | ✅ | |
| 14 | ReMI: A Dataset for Reasoning with Multiple Images | ReMI:一个用于多图推理的大型语言模型评测数据集 | large language model | ✅ | |
| 15 | Zoom and Shift are All You Need | 提出一种基于缩放与平移的多模态特征对齐方法,实现模态信息深度融合 | multimodal |
🔬 支柱三:空间感知与语义 (Perception & Semantics) (15 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 16 | GaussianForest: Hierarchical-Hybrid 3D Gaussian Splatting for Compressed Scene Modeling | 提出GaussianForest,通过分层混合高斯表示压缩场景模型,显著降低存储需求。 | 3D gaussian splatting gaussian splatting splatting | ✅ | |
| 17 | Depth Anything V2 | Depth Anything V2:通过大规模合成数据和知识蒸馏,实现高效鲁棒的单目深度估计 | depth estimation monocular depth metric depth | ||
| 18 | Scale-Invariant Monocular Depth Estimation via SSI Depth | 利用SSI深度,实现尺度不变单目深度估计,提升泛化能力。 | depth estimation monocular depth | ||
| 19 | ImageNet3D: Towards General-Purpose Object-Level 3D Understanding | 提出ImageNet3D,用于通用物体级3D理解的大规模数据集。 | open-vocabulary open vocabulary large language model | ||
| 20 | Modeling Ambient Scene Dynamics for Free-view Synthesis | 提出基于周期性运动建模的动态场景自由视角合成方法 | 3D gaussian splatting 3DGS gaussian splatting | ||
| 21 | Neural NeRF Compression | 提出一种基于神经压缩的NeRF模型压缩方法,有效降低存储开销。 | NeRF neural radiance field | ||
| 22 | GGHead: Fast and Generalizable 3D Gaussian Heads | 提出GGHead,利用3D高斯头部实现快速且可泛化的3D人头生成。 | 3D gaussian splatting gaussian splatting splatting | ✅ | |
| 23 | NeRF Director: Revisiting View Selection in Neural Volume Rendering | NeRF Director:重新审视神经体积渲染中的视角选择问题 | NeRF | ||
| 24 | MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding | MuirBench:一个用于鲁棒多图像理解的综合性评测基准 | scene understanding multimodal | ||
| 25 | Enhanced Object Detection: A Study on Vast Vocabulary Object Detection Track for V3Det Challenge 2024 | 针对V3Det挑战赛,提出改进的广词汇目标检测方案,提升复杂类别和检测框的处理能力。 | open-vocabulary open vocabulary | ||
| 26 | 3D-AVS: LiDAR-based 3D Auto-Vocabulary Segmentation | 提出3D-AVS,实现无需人工干预的LiDAR点云自动词汇分割 | open-vocabulary open vocabulary | ✅ | |
| 27 | ToSA: Token Selective Attention for Efficient Vision Transformers | 提出Token选择性注意力(ToSA),用于高效的Vision Transformer。 | depth estimation monocular depth | ||
| 28 | Instruct 4D-to-4D: Editing 4D Scenes as Pseudo-3D Scenes Using 2D Diffusion | Instruct 4D-to-4D:利用2D扩散模型实现高质量、时空一致的4D场景编辑 | optical flow | ||
| 29 | WonderWorld: Interactive 3D Scene Generation from a Single Image | WonderWorld:基于单张图像的交互式3D场景生成框架 | depth estimation | ||
| 30 | OpenMaterial: A Large-scale Dataset of Complex Materials for 3D Reconstruction | OpenMaterial:大规模复杂材质3D重建数据集,提升真实感重建效果 | neural radiance field |
🔬 支柱二:RL算法与架构 (RL & Architecture) (12 篇)
🔬 支柱六:视频提取与匹配 (Video Extraction) (5 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 43 | SViTT-Ego: A Sparse Video-Text Transformer for Egocentric Video | 提出SViTT-Ego:一种稀疏视频-文本Transformer模型,用于提升第一人称视角视频理解。 | egocentric egocentric vision foundation model | ||
| 44 | Introducing HOT3D: An Egocentric Dataset for 3D Hand and Object Tracking | HOT3D:用于3D手部和物体跟踪的以自我为中心的视觉数据集 | MANO egocentric | ✅ | |
| 45 | CARLOR @ Ego4D Step Grounding Challenge: Bayesian temporal-order priors for test time refinement | 提出基于贝叶斯时序先验的Bayesian-VSLNet,用于Ego4D视频中的步骤定位。 | egocentric Ego4D | ||
| 46 | Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos | 提出AV-LDM模型,从第一视角视频中生成环境感知的动作声音 | egocentric Ego4D | ||
| 47 | EgoExo-Fitness: Towards Egocentric and Exocentric Full-Body Action Understanding | EgoExo-Fitness:提出一个用于第一人称和第三人称视角全身动作理解的新数据集。 | egocentric | ✅ |
🔬 支柱七:动作重定向 (Motion Retargeting) (2 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 48 | MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations | MMScan:构建具有分层语言标注的多模态3D场景数据集,促进3D感知研究。 | spatial relationship visual grounding | ✅ | |
| 49 | SPAN: Unlocking Pyramid Representations for Gigapixel Histopathological Images | SPAN:解锁金字塔表示,用于千兆像素组织病理学图像分析 | spatial relationship |
🔬 支柱一:机器人控制 (Robot Control) (2 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 50 | SimGen: Simulator-conditioned Driving Scene Generation | SimGen:提出模拟器条件下的驾驶场景生成框架,提升合成数据质量与多样性 | sim-to-real | ||
| 51 | Large-Scale Evaluation of Open-Set Image Classification Techniques | 大规模评估开放集图像分类技术,揭示现有算法在未知类别泛化性上的局限性。 | OSC |