cs.CV(2024-05-30)

📊 共 46 篇论文 | 🔗 12 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (13 🔗3) 支柱三:空间感知与语义 (Perception & Semantics) (12 🔗4) 支柱二:RL算法与架构 (RL & Architecture) (9 🔗5) 支柱一:机器人控制 (Robot Control) (5) 支柱六:视频提取与匹配 (Video Extraction) (5) 支柱四:生成式动作 (Generative Motion) (2)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (13 篇)

#题目一句话要点标签🔗
1 Transfer Attack for Bad and Good: Explain and Boost Adversarial Transferability across Multimodal Large Language Models 提出对抗性转移攻击方法以提升多模态大语言模型的鲁棒性 large language model multimodal
2 Temporal Grounding of Activities using Multimodal Large Language Models 提出基于多模态大语言模型的时序活动定位方法,优于现有视频LLM。 large language model multimodal
3 Visual Perception by Large Language Model's Weights 提出VLoRA以解决多模态大语言模型的计算效率问题 large language model multimodal
4 LLMGeo: Benchmarking Large Language Models on Image Geolocation In-the-wild LLMGeo:评估大语言模型在复杂场景下的图像地理定位能力 large language model multimodal
5 Instruction-Guided Visual Masking 提出指令引导的视觉掩码IVM,提升多模态模型对复杂指令的理解和对齐能力。 multimodal instruction following visual grounding
6 A Multimodal Dangerous State Recognition and Early Warning System for Elderly with Intermittent Dementia 针对老年痴呆症患者,提出多模态危险状态识别与预警系统,解决走失问题。 multimodal
7 FMARS: Annotating Remote Sensing Images for Disaster Management using Foundation Models FMARS:利用Foundation Model标注遥感影像,助力灾害管理 foundation model
8 Learning Robust Correlation with Foundation Model for Weakly-Supervised Few-Shot Segmentation 提出CORENet,利用基础模型学习鲁棒相关性,解决弱监督少样本分割问题 foundation model
9 Uncovering Bias in Large Vision-Language Models at Scale with Counterfactuals 利用对抗样本大规模揭示大型视觉语言模型中的偏见 large language model multimodal
10 AutoBreach: Universal and Adaptive Jailbreaking with Efficient Wordplay-Guided Optimization AutoBreach:利用高效文字游戏优化实现通用自适应的大语言模型越狱攻击 large language model chain-of-thought
11 VAAD: Visual Attention Analysis Dashboard applied to e-Learning VAAD:用于在线学习的视觉注意力分析仪表盘,提升学习行为洞察 multimodal
12 LLM as a Complementary Optimizer to Gradient Descent: A Case Study in Prompt Tuning 提出LLM辅助梯度下降优化框架,提升Prompt Tuning效果 large language model
13 Enhancing Large Vision Language Models with Self-Training on Image Comprehension 提出STIC,通过图像理解自训练增强大规模视觉语言模型,减少对标注数据的依赖。 large language model

🔬 支柱三:空间感知与语义 (Perception & Semantics) (12 篇)

#题目一句话要点标签🔗
14 GaussianRoom: Improving 3D Gaussian Splatting with SDF Guidance and Monocular Cues for Indoor Scene Reconstruction GaussianRoom:结合SDF引导和单目线索,提升3D高斯溅射在室内场景重建效果 3D gaussian splatting 3DGS gaussian splatting
15 $\textit{S}^3$Gaussian: Self-Supervised Street Gaussians for Autonomous Driving 提出自监督街景高斯方法,无需3D标注实现自动驾驶场景的动态静态元素分解。 3D gaussian splatting 3DGS gaussian splatting
16 OpenDAS: Open-Vocabulary Domain Adaptation for 2D and 3D Segmentation 提出OpenDAS,通过开放词汇域自适应提升2D/3D分割性能 open-vocabulary open vocabulary
17 RTGen: Generating Region-Text Pairs for Open-Vocabulary Object Detection RTGen:生成区域-文本对,提升开放词汇目标检测性能 open-vocabulary open vocabulary
18 EMAG: Ego-motion Aware and Generalizable 2D Hand Forecasting from Egocentric Videos 提出EMAG,解决以自我为中心的视频中手部动作预测的视角依赖和泛化性问题 optical flow egocentric Ego4D
19 IReNe: Instant Recoloring of Neural Radiance Fields IReNe:实现神经辐射场的即时颜色重着色,提升编辑效率与真实感。 NeRF neural radiance field scene reconstruction
20 Uncertainty-guided Optimal Transport in Depth Supervised Sparse-View 3D Gaussian 提出UGOT方法,利用不确定性引导的最优传输解决稀疏视角3D高斯重建问题 depth estimation monocular depth 3D gaussian splatting
21 A Pixel Is Worth More Than One 3D Gaussians in Single-View 3D Reconstruction 提出分层 Splatter Image 方法,利用多高斯模型提升单视角3D重建中遮挡区域的建模能力。 3D gaussian splatting 3DGS gaussian splatting
22 View-Consistent Hierarchical 3D Segmentation Using Ultrametric Feature Fields 提出基于超度量特征场的3D一致性分层分割方法,解决视角不一致问题。 NeRF neural radiance field foundation model
23 TetSphere Splatting: Representing High-Quality Geometry with Lagrangian Volumetric Meshes 提出TetSphere Splatting,利用四面体网格实现高质量3D形状建模。 splatting
24 Gated Fields: Learning Scene Reconstruction from Gated Videos 提出Gated Fields,利用主动门控视频序列实现室外场景的精确3D重建 scene reconstruction
25 CLAY: A Controllable Large-scale Generative Model for Creating High-quality 3D Assets CLAY:一种可控的大规模生成模型,用于创建高质量3D资产 implicit representation

🔬 支柱二:RL算法与架构 (RL & Architecture) (9 篇)

#题目一句话要点标签🔗
26 NoiseBoost: Alleviating Hallucination with Noise Perturbation for Multimodal Large Language Models 提出NoiseBoost以解决多模态大语言模型的幻觉问题 reinforcement learning large language model multimodal
27 PLA4D: Pixel-Level Alignments for Text-to-4D Gaussian Splatting 提出PLA4D以解决文本驱动4D渲染中的运动与几何冲突问题 contrastive learning distillation gaussian splatting
28 Multimodal Cross-Domain Few-Shot Learning for Egocentric Action Recognition 提出MM-CDFSL,通过多模态蒸馏和掩码推理解决自中心动作识别中的跨域少样本学习问题。 distillation egocentric multimodal
29 EgoSurgery-Phase: A Dataset of Surgical Phase Recognition from Egocentric Open Surgery Videos EgoSurgery-Phase:发布首个开放手术阶段识别的头戴相机视角视频数据集,并提出注视引导的掩码自编码器。 masked autoencoder MAE egocentric
30 Multi-Label Guided Soft Contrastive Learning for Efficient Earth Observation Pretraining 提出多标签引导的软对比学习,高效预训练地球观测模型。 contrastive learning foundation model
31 Boost Your Human Image Generation Model via Direct Preference Optimization 提出HG-DPO以提升人类图像生成模型的真实感 DPO direct preference optimization curriculum learning
32 MotionDreamer: Exploring Semantic Video Diffusion features for Zero-Shot 3D Mesh Animation MotionDreamer:利用视频扩散模型的语义特征实现零样本3D网格动画 dreamer
33 Estimating Human Poses Across Datasets: A Unified Skeleton and Multi-Teacher Distillation Approach 提出统一骨架与多教师蒸馏方法,提升跨数据集人体姿态估计泛化性 distillation
34 DeMamba: AI-Generated Video Detection on Million-Scale GenVideo Benchmark 提出DeMamba模块与GenVideo基准,提升AI生成视频检测的泛化性与鲁棒性。 Mamba

🔬 支柱一:机器人控制 (Robot Control) (5 篇)

#题目一句话要点标签🔗
35 SAM-E: Leveraging Visual Foundation Model with Sequence Imitation for Embodied Manipulation SAM-E:利用视觉基础模型和序列模仿进行具身操作 manipulation scene understanding foundation model
36 May the Dance be with You: Dance Generation Framework for Non-Humanoids 提出一种非人形智能体舞蹈生成框架,通过视觉节奏与音乐的关联学习舞蹈动作。 humanoid reinforcement learning contrastive learning
37 Learning 3D Robotics Perception using Inductive Priors 利用归纳偏置学习3D机器人感知,提升泛化性和降低数据依赖。 sim2real scene understanding semantic map
38 HINT: Learning Complete Human Neural Representations from Limited Viewpoints HINT:提出一种基于NeRF的人体神经表示学习方法,解决有限视角下完整人体建模问题。 humanoid NeRF
39 ParSEL: Parameterized Shape Editing with Language ParSEL:提出一种基于语言的参数化形状编辑方法,实现对3D资产的可控编辑。 manipulation

🔬 支柱六:视频提取与匹配 (Video Extraction) (5 篇)

#题目一句话要点标签🔗
40 MotionLLM: Understanding Human Behaviors from Human Motions and Videos 提出MotionLLM以解决多模态人类行为理解问题 SMPL human motion large language model
41 SMPLX-Lite: A Realistic and Drivable Avatar Benchmark with Rich Geometry and Texture Annotations 提出SMPLX-Lite数据集和参数化模型,用于驱动逼真且可控的全身虚拟化身 SMPL-X human motion
42 Video Question Answering for People with Visual Impairments Using an Egocentric 360-Degree Camera 提出基于360度第一视角视频的视觉问答数据集,辅助视觉障碍人士。 egocentric
43 OmniHands: Towards Robust 4D Hand Mesh Recovery via A Versatile Transformer OmniHands:通过通用Transformer实现鲁棒的4D手部网格重建 hand reconstruction
44 Can't make an Omelette without Breaking some Eggs: Plausible Action Anticipation using Large Video-Language Models 提出PlausiVL,利用视频-语言大模型进行符合现实的动作序列预测。 Ego4D

🔬 支柱四:生成式动作 (Generative Motion) (2 篇)

#题目一句话要点标签🔗
45 RapVerse: Coherent Vocals and Whole-Body Motions Generations from Text RapVerse:提出一种从文本生成连贯歌声和全身动作的统一框架 motion generation human motion multimodal
46 Stratified Avatar Generation from Sparse Observations 提出分层生成方法,从稀疏观测中重建全身虚拟化身 VQ-VAE SMPL

⬅️ 返回 cs.CV 首页 · 🏠 返回主页