cs.CV(2024-06-13)

📊 共 51 篇论文 | 🔗 20 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (15 🔗10) 支柱三:空间感知与语义 (Perception & Semantics) (15 🔗3) 支柱二:RL算法与架构 (RL & Architecture) (12 🔗4) 支柱六:视频提取与匹配 (Video Extraction) (5 🔗2) 支柱七:动作重定向 (Motion Retargeting) (2 🔗1) 支柱一:机器人控制 (Robot Control) (2)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (15 篇)

#题目一句话要点标签🔗
1 Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models 提出Visual Sketchpad,赋予多模态语言模型视觉草稿本,提升复杂推理能力 multimodal chain-of-thought
2 Towards Vision-Language Geo-Foundation Model: A Survey 综述性论文:面向视觉-语言地理基础模型(VLGFM)的研究进展与未来方向。 foundation model multimodal visual grounding
3 Multiagent Multitraversal Multimodal Self-Driving: Open MARS Dataset 提出MARS数据集,用于多智能体、多视角、多模态自动驾驶研究。 multimodal
4 MMRel: Benchmarking Relation Understanding in Multi-Modal Large Language Models 提出MMRel基准以解决多模态大语言模型的关系理解问题 large language model
5 Dual Attribute-Spatial Relation Alignment for 3D Visual Grounding 提出DASANet,通过双分支对齐属性-空间关系特征实现更精准的3D视觉定位 visual grounding
6 MMFakeBench: A Mixed-Source Multimodal Misinformation Detection Benchmark for LVLMs 提出MMFakeBench:一个面向LVLM的混合源多模态虚假信息检测基准 multimodal
7 VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding VideoGPT+:融合图像与视频编码器,提升视频理解能力 large language model multimodal
8 Explore the Limits of Omni-modal Pretraining at Scale 提出MiCo,一种可扩展的通用多模态预训练框架,显著提升多模态理解能力。 large language model multimodal
9 Needle In A Video Haystack: A Scalable Synthetic Evaluator for Video MLLMs VideoNIAH:一种可扩展的视频MLLM合成评估器,用于解决视频理解模型评估难题。 large language model multimodal
10 Comparison Visual Instruction Tuning 提出CaD-VI框架与CaD-Inst数据集,提升LMMs在图像对比任务中的性能。 multimodal instruction following
11 INS-MMBench: A Comprehensive Benchmark for Evaluating LVLMs' Performance in Insurance INS-MMBench:首个保险领域多模态大模型综合评测基准,覆盖22项基础任务。 large language model multimodal
12 Parameter-Efficient Active Learning for Foundational models 提出参数高效主动学习框架,提升基础模型在小样本图像分类任务中的性能 foundation model
13 Language-driven Grasp Detection 提出基于扩散模型的语言驱动抓取检测方法,并构建大规模数据集 Grasp-Anything++。 foundation model
14 ReMI: A Dataset for Reasoning with Multiple Images ReMI:一个用于多图推理的大型语言模型评测数据集 large language model
15 Zoom and Shift are All You Need 提出一种基于缩放与平移的多模态特征对齐方法,实现模态信息深度融合 multimodal

🔬 支柱三:空间感知与语义 (Perception & Semantics) (15 篇)

#题目一句话要点标签🔗
16 GaussianForest: Hierarchical-Hybrid 3D Gaussian Splatting for Compressed Scene Modeling 提出GaussianForest,通过分层混合高斯表示压缩场景模型,显著降低存储需求。 3D gaussian splatting gaussian splatting splatting
17 Depth Anything V2 Depth Anything V2:通过大规模合成数据和知识蒸馏,实现高效鲁棒的单目深度估计 depth estimation monocular depth metric depth
18 Scale-Invariant Monocular Depth Estimation via SSI Depth 利用SSI深度,实现尺度不变单目深度估计,提升泛化能力。 depth estimation monocular depth
19 ImageNet3D: Towards General-Purpose Object-Level 3D Understanding 提出ImageNet3D,用于通用物体级3D理解的大规模数据集。 open-vocabulary open vocabulary large language model
20 Modeling Ambient Scene Dynamics for Free-view Synthesis 提出基于周期性运动建模的动态场景自由视角合成方法 3D gaussian splatting 3DGS gaussian splatting
21 Neural NeRF Compression 提出一种基于神经压缩的NeRF模型压缩方法,有效降低存储开销。 NeRF neural radiance field
22 GGHead: Fast and Generalizable 3D Gaussian Heads 提出GGHead,利用3D高斯头部实现快速且可泛化的3D人头生成。 3D gaussian splatting gaussian splatting splatting
23 NeRF Director: Revisiting View Selection in Neural Volume Rendering NeRF Director:重新审视神经体积渲染中的视角选择问题 NeRF
24 MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding MuirBench:一个用于鲁棒多图像理解的综合性评测基准 scene understanding multimodal
25 Enhanced Object Detection: A Study on Vast Vocabulary Object Detection Track for V3Det Challenge 2024 针对V3Det挑战赛,提出改进的广词汇目标检测方案,提升复杂类别和检测框的处理能力。 open-vocabulary open vocabulary
26 3D-AVS: LiDAR-based 3D Auto-Vocabulary Segmentation 提出3D-AVS,实现无需人工干预的LiDAR点云自动词汇分割 open-vocabulary open vocabulary
27 ToSA: Token Selective Attention for Efficient Vision Transformers 提出Token选择性注意力(ToSA),用于高效的Vision Transformer。 depth estimation monocular depth
28 Instruct 4D-to-4D: Editing 4D Scenes as Pseudo-3D Scenes Using 2D Diffusion Instruct 4D-to-4D:利用2D扩散模型实现高质量、时空一致的4D场景编辑 optical flow
29 WonderWorld: Interactive 3D Scene Generation from a Single Image WonderWorld:基于单张图像的交互式3D场景生成框架 depth estimation
30 OpenMaterial: A Large-scale Dataset of Complex Materials for 3D Reconstruction OpenMaterial:大规模复杂材质3D重建数据集,提升真实感重建效果 neural radiance field

🔬 支柱二:RL算法与架构 (RL & Architecture) (12 篇)

#题目一句话要点标签🔗
31 Multiple Prior Representation Learning for Self-Supervised Monocular Depth Estimation via Hybrid Transformer 提出多重先验表示学习以解决自监督单目深度估计问题 representation learning depth estimation monocular depth
32 LLAVIDAL: A Large LAnguage VIsion Model for Daily Activities of Living 提出LLAVIDAL,用于提升大语言视觉模型在日常生活活动理解中的性能 representation learning human-object interaction HOI
33 4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities 4M-21:面向数十种任务和模态的Any-to-Any视觉模型,扩展多模态能力。 distillation 4DHumans foundation model
34 QMamba: On First Exploration of Vision Mamba for Image Quality Assessment QMamba:首次探索视觉Mamba在图像质量评估中的应用 Mamba state space model foundation model
35 Rethinking Score Distillation as a Bridge Between Image Distributions 通过图像分布桥梁重新审视Score Distillation,提升生成质量。 distillation NeRF
36 Towards Evaluating the Robustness of Visual State Space Models 评估视觉状态空间模型在各种扰动下的鲁棒性,并与Transformer和CNN进行对比。 Mamba state space model
37 ConsistDreamer: 3D-Consistent 2D Diffusion for High-Fidelity Scene Editing ConsistDreamer:利用3D一致性2D扩散模型实现高保真场景编辑 dreamer
38 Aligning Vision Models with Human Aesthetics in Retrieval: Benchmarks and Algorithms 提出基于偏好强化学习的视觉模型对齐方法,提升图像检索系统美学质量 reinforcement learning large language model
39 SeMOPO: Learning High-quality Model and Policy from Low-quality Offline Visual Datasets SeMOPO:从低质量离线视觉数据集中学习高质量模型和策略 reinforcement learning offline RL offline reinforcement learning
40 PC-LoRA: Low-Rank Adaptation for Progressive Model Compression with Knowledge Distillation 提出PC-LoRA,通过低秩适配实现模型压缩与知识蒸馏的同步进行 distillation
41 DenoiseRep: Denoising Model for Representation Learning 提出DenoiseRep,通过联合特征提取和去噪提升表征学习能力。 representation learning
42 Preserving Identity with Variational Score for General-purpose 3D Editing Piva:基于变分Score蒸馏的通用3D编辑方法,保持身份信息 distillation NeRF

🔬 支柱六:视频提取与匹配 (Video Extraction) (5 篇)

#题目一句话要点标签🔗
43 SViTT-Ego: A Sparse Video-Text Transformer for Egocentric Video 提出SViTT-Ego:一种稀疏视频-文本Transformer模型,用于提升第一人称视角视频理解。 egocentric egocentric vision foundation model
44 Introducing HOT3D: An Egocentric Dataset for 3D Hand and Object Tracking HOT3D:用于3D手部和物体跟踪的以自我为中心的视觉数据集 MANO egocentric
45 CARLOR @ Ego4D Step Grounding Challenge: Bayesian temporal-order priors for test time refinement 提出基于贝叶斯时序先验的Bayesian-VSLNet,用于Ego4D视频中的步骤定位。 egocentric Ego4D
46 Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos 提出AV-LDM模型,从第一视角视频中生成环境感知的动作声音 egocentric Ego4D
47 EgoExo-Fitness: Towards Egocentric and Exocentric Full-Body Action Understanding EgoExo-Fitness:提出一个用于第一人称和第三人称视角全身动作理解的新数据集。 egocentric

🔬 支柱七:动作重定向 (Motion Retargeting) (2 篇)

#题目一句话要点标签🔗
48 MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations MMScan:构建具有分层语言标注的多模态3D场景数据集,促进3D感知研究。 spatial relationship visual grounding
49 SPAN: Unlocking Pyramid Representations for Gigapixel Histopathological Images SPAN:解锁金字塔表示,用于千兆像素组织病理学图像分析 spatial relationship

🔬 支柱一:机器人控制 (Robot Control) (2 篇)

#题目一句话要点标签🔗
50 SimGen: Simulator-conditioned Driving Scene Generation SimGen:提出模拟器条件下的驾驶场景生成框架,提升合成数据质量与多样性 sim-to-real
51 Large-Scale Evaluation of Open-Set Image Classification Techniques 大规模评估开放集图像分类技术,揭示现有算法在未知类别泛化性上的局限性。 OSC

⬅️ 返回 cs.CV 首页 · 🏠 返回主页