cs.CV（2024-06-12）

📊 共 19 篇论文 | 🔗 4 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (8 🔗1) 支柱二：RL算法与架构 (RL & Architecture) (5 🔗2) 支柱一：机器人控制 (Robot Control) (3) 支柱三：空间感知与语义 (Perception & Semantics) (2 🔗1) 支柱六：视频提取与匹配 (Video Extraction) (1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (8 篇)

#	题目	一句话要点	标签	🔗	⭐
1	VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks	VisionLLM v2：提出通用多模态大语言模型，统一视觉感知、理解和生成任务。	large language model multimodal
2	OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text	提出OmniCorpus，一个包含百亿级图像与文本交错的大规模多模态数据集，促进多模态大语言模型发展。	large language model multimodal	✅
3	Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models	SliME：面向高分辨率图像，通过局部压缩和全局专家混合提升多模态大模型性能	multimodal
4	LLM-assisted Concept Discovery: Automatically Identifying and Explaining Neuron Functions	提出LLM辅助的概念发现方法，自动识别并解释神经网络神经元功能	large language model multimodal
5	Real2Code: Reconstruct Articulated Objects via Code Generation	Real2Code：通过代码生成重建铰接物体，突破复杂度和真实场景限制。	large language model
6	GUIOdyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices	提出GUIOdyssey数据集，用于提升移动设备跨应用GUI导航Agent性能	multimodal
7	APSeg: Auto-Prompt Network for Cross-Domain Few-Shot Semantic Segmentation	APSeg：用于跨域少样本语义分割的自动提示网络	foundation model
8	Refusal as Silence: Gendered Disparities in Vision-Language Model Responses	通过性别化身份提示，揭示视觉语言模型拒绝行为中的性别歧视	large language model

🔬 支柱二：RL算法与架构 (RL & Architecture) (5 篇)

#	题目	一句话要点	标签	🔗	⭐
9	Pandora: Towards General World Model with Natural Language Actions and Video States	Pandora：基于自然语言动作和视频状态的通用世界模型	world model large language model foundation model
10	PixMamba: Leveraging State Space Models in a Dual-Level Architecture for Underwater Image Enhancement	PixMamba：双层状态空间模型用于水下图像增强，提升全局一致性。	Mamba SSM state space model	✅
11	MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos	提出MMWorld：一个用于评估视频中多学科多方面世界模型的基准。	world model multimodal
12	DistilDoc: Knowledge Distillation for Visually-Rich Document Applications	提出DistilDoc，利用知识蒸馏提升视觉文档理解任务的模型效率与鲁棒性	teacher-student distillation
13	UDON: Universal Dynamic Online distillatioN for generic image representations	提出UDON：一种用于通用图像表征的通用动态在线蒸馏方法	distillation	✅

🔬 支柱一：机器人控制 (Robot Control) (3 篇)

#	题目	一句话要点	标签	🔗	⭐
14	OpenObj: Open-Vocabulary Object-Level Neural Radiance Fields with Fine-Grained Understanding	OpenObj：提出具有细粒度理解的开放词汇对象级神经辐射场	manipulation NeRF neural radiance field
15	Gazing Into Missteps: Leveraging Eye-Gaze for Unsupervised Mistake Detection in Egocentric Videos of Skilled Human Activities	利用眼动追踪进行熟练技能活动中第一人称视频的无监督错误检测	manipulation egocentric
16	Outdoor Scene Extrapolation with Hierarchical Generative Cellular Automata	提出分层生成细胞自动机，用于大规模室外场景几何体的外推生成。	sim-to-real

🔬 支柱三：空间感知与语义 (Perception & Semantics) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
17	From Chaos to Clarity: 3DGS in the Dark	提出Raw3DGS框架，解决低光照raw图像下3DGS重建质量下降问题	3D gaussian splatting 3DGS gaussian splatting	✅
18	Category-level Neural Field for Reconstruction of Partially Observed Objects in Indoor Environment	提出类别级神经场，用于室内环境中部分观测物体的三维重建	implicit representation scene understanding

🔬 支柱六：视频提取与匹配 (Video Extraction) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
19	Labeling Comic Mischief Content in Online Videos with a Multimodal Hierarchical-Cross-Attention Model	提出一种多模态分层交叉注意力模型，用于检测在线视频中的滑稽恶作剧内容。	HuMoR multimodal

⬅️ 返回 cs.CV 首页 · 🏠 返回主页