cs.CV(2024-06-12)

📊 共 19 篇论文 | 🔗 4 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (8 🔗1) 支柱二:RL算法与架构 (RL & Architecture) (5 🔗2) 支柱一:机器人控制 (Robot Control) (3) 支柱三:空间感知与语义 (Perception & Semantics) (2 🔗1) 支柱六:视频提取与匹配 (Video Extraction) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (8 篇)

#题目一句话要点标签🔗
1 VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks VisionLLM v2:提出通用多模态大语言模型,统一视觉感知、理解和生成任务。 large language model multimodal
2 OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text 提出OmniCorpus,一个包含百亿级图像与文本交错的大规模多模态数据集,促进多模态大语言模型发展。 large language model multimodal
3 Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models SliME:面向高分辨率图像,通过局部压缩和全局专家混合提升多模态大模型性能 multimodal
4 LLM-assisted Concept Discovery: Automatically Identifying and Explaining Neuron Functions 提出LLM辅助的概念发现方法,自动识别并解释神经网络神经元功能 large language model multimodal
5 Real2Code: Reconstruct Articulated Objects via Code Generation Real2Code:通过代码生成重建铰接物体,突破复杂度和真实场景限制。 large language model
6 GUIOdyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices 提出GUIOdyssey数据集,用于提升移动设备跨应用GUI导航Agent性能 multimodal
7 APSeg: Auto-Prompt Network for Cross-Domain Few-Shot Semantic Segmentation APSeg:用于跨域少样本语义分割的自动提示网络 foundation model
8 Refusal as Silence: Gendered Disparities in Vision-Language Model Responses 通过性别化身份提示,揭示视觉语言模型拒绝行为中的性别歧视 large language model

🔬 支柱二:RL算法与架构 (RL & Architecture) (5 篇)

#题目一句话要点标签🔗
9 Pandora: Towards General World Model with Natural Language Actions and Video States Pandora:基于自然语言动作和视频状态的通用世界模型 world model large language model foundation model
10 PixMamba: Leveraging State Space Models in a Dual-Level Architecture for Underwater Image Enhancement PixMamba:双层状态空间模型用于水下图像增强,提升全局一致性。 Mamba SSM state space model
11 MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos 提出MMWorld:一个用于评估视频中多学科多方面世界模型的基准。 world model multimodal
12 DistilDoc: Knowledge Distillation for Visually-Rich Document Applications 提出DistilDoc,利用知识蒸馏提升视觉文档理解任务的模型效率与鲁棒性 teacher-student distillation
13 UDON: Universal Dynamic Online distillatioN for generic image representations 提出UDON:一种用于通用图像表征的通用动态在线蒸馏方法 distillation

🔬 支柱一:机器人控制 (Robot Control) (3 篇)

#题目一句话要点标签🔗
14 OpenObj: Open-Vocabulary Object-Level Neural Radiance Fields with Fine-Grained Understanding OpenObj:提出具有细粒度理解的开放词汇对象级神经辐射场 manipulation NeRF neural radiance field
15 Gazing Into Missteps: Leveraging Eye-Gaze for Unsupervised Mistake Detection in Egocentric Videos of Skilled Human Activities 利用眼动追踪进行熟练技能活动中第一人称视频的无监督错误检测 manipulation egocentric
16 Outdoor Scene Extrapolation with Hierarchical Generative Cellular Automata 提出分层生成细胞自动机,用于大规模室外场景几何体的外推生成。 sim-to-real

🔬 支柱三:空间感知与语义 (Perception & Semantics) (2 篇)

#题目一句话要点标签🔗
17 From Chaos to Clarity: 3DGS in the Dark 提出Raw3DGS框架,解决低光照raw图像下3DGS重建质量下降问题 3D gaussian splatting 3DGS gaussian splatting
18 Category-level Neural Field for Reconstruction of Partially Observed Objects in Indoor Environment 提出类别级神经场,用于室内环境中部分观测物体的三维重建 implicit representation scene understanding

🔬 支柱六:视频提取与匹配 (Video Extraction) (1 篇)

#题目一句话要点标签🔗
19 Labeling Comic Mischief Content in Online Videos with a Multimodal Hierarchical-Cross-Attention Model 提出一种多模态分层交叉注意力模型,用于检测在线视频中的滑稽恶作剧内容。 HuMoR multimodal

⬅️ 返回 cs.CV 首页 · 🏠 返回主页