cs.CV(2024-10-07)

📊 共 30 篇论文 | 🔗 10 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (10 🔗3) 支柱三:空间感知与语义 (Perception & Semantics) (8 🔗4) 支柱二:RL算法与架构 (RL & Architecture) (7 🔗1) 支柱六:视频提取与匹配 (Video Extraction) (3 🔗2) 支柱一:机器人控制 (Robot Control) (2)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (10 篇)

#题目一句话要点标签🔗
1 Mitigating Modality Prior-Induced Hallucinations in Multimodal Large Language Models via Deciphering Attention Causality 提出CausalMM框架,通过解耦注意力因果关系缓解多模态大语言模型中的幻觉问题 large language model multimodal
2 ActiView: Evaluating Active Perception Ability for Multimodal Large Language Models ActiView:提出多模态大语言模型主动感知能力评测基准 large language model multimodal
3 VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks VLM2Vec:训练视觉-语言模型用于大规模多模态嵌入任务 multimodal visual grounding
4 R-Bench: Are your Large Multimodal Model Robust to Real-world Corruptions? R-Bench:评估大模型在真实世界图像失真下的鲁棒性 multimodal
5 Leveraging Multimodal Diffusion Models to Accelerate Imaging with Side Information 利用多模态扩散模型加速侧信息辅助成像,减少昂贵模态数据需求 multimodal
6 Zero-Shot Vision-and-Language Navigation with Collision Mitigation in Continuous Environment 提出VLN-CM零样本视觉语言导航方法,解决连续环境中的碰撞问题 VLN large language model foundation model
7 Multimodal Fusion Strategies for Mapping Biophysical Landscape Features 研究多模态融合策略,用于非洲稀树草原生态系统中生物物理景观特征的精确映射。 multimodal
8 Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation 提出PhyGenBench,评估文生视频模型在物理常识方面的能力。 large language model
9 Art2Mus: Bridging Visual Arts and Music through Cross-Modal Generation 提出Art2Mus以解决复杂艺术作品生成音乐的问题 RT-2
10 Intriguing Properties of Large Language and Vision Models 揭示大型语言-视觉模型(LLVMs)的内在特性,探究其感知能力与局限性 large language model

🔬 支柱三:空间感知与语义 (Perception & Semantics) (8 篇)

#题目一句话要点标签🔗
11 6DGS: Enhanced Direction-Aware Gaussian Splatting for Volumetric Rendering 提出6DGS,通过增强方向感知高斯溅射实现高质量实时体渲染 3D gaussian splatting 3DGS gaussian splatting
12 GS-VTON: Controllable 3D Virtual Try-on with Gaussian Splatting GS-VTON:利用高斯溅射实现可控的3D虚拟试穿 3D gaussian splatting 3DGS gaussian splatting
13 LiDAR-GS:Real-time LiDAR Re-Simulation using Gaussian Splatting 提出LiDAR-GS,利用高斯溅射实现城市道路场景中激光雷达扫描的实时高保真重模拟。 gaussian splatting splatting NeRF
14 TeX-NeRF: Neural Radiance Fields from Pseudo-TeX Vision 提出TeX-NeRF以解决低光环境下3D重建问题 NeRF neural radiance field scene reconstruction
15 OmniBooth: Learning Latent Control for Image Synthesis with Multi-modal Instruction OmniBooth:通过多模态指令学习图像合成的潜在控制 open-vocabulary open vocabulary multimodal
16 DreamSat: Towards a General 3D Model for Novel View Synthesis of Space Objects DreamSat:通过微调Zero123 XL,实现空间物体新视角合成的通用3D模型 3D gaussian splatting gaussian splatting splatting
17 Toward General Object-level Mapping from Sparse Views with 3D Diffusion Priors 提出GOM,利用3D扩散先验从稀疏视角实现通用物体级地图构建 NeRF neural radiance field
18 EmoGene: Audio-Driven Emotional 3D Talking-Head Generation EmoGene:提出音频驱动的情感3D说话头生成框架,提升情感表达准确性。 NeRF neural radiance field

🔬 支柱二:RL算法与架构 (RL & Architecture) (7 篇)

#题目一句话要点标签🔗
19 DartControl: A Diffusion-Based Autoregressive Motion Model for Real-Time Text-Driven Motion Control 提出DartControl,一种基于扩散模型的自回归运动模型,用于实时文本驱动的运动控制。 reinforcement learning text-driven motion motion synthesis
20 PH-Dropout: Practical Epistemic Uncertainty Quantification for View Synthesis 提出PH-Dropout,实现NeRF和GS实时、精确的认知不确定性量化 representation learning gaussian splatting splatting
21 IGroupSS-Mamba: Interval Group Spatial-Spectral Mamba for Hyperspectral Image Classification 提出IGroupSS-Mamba框架,用于高光谱图像分类,提升性能与效率。 Mamba state space model HSI
22 Resource-Efficient Multiview Perception: Integrating Semantic Masking with Masked Autoencoders 提出语义引导掩码自编码器,用于资源受限的多视角高效感知。 masked autoencoder MAE scene understanding
23 Bridging SFT and DPO for Diffusion Model Alignment with Self-Sampling Preference Optimization 提出自采样偏好优化SSPO,提升扩散模型对齐效果,兼顾SFT稳定性和RL泛化性 reinforcement learning DPO direct preference optimization
24 MetaDD: Boosting Dataset Distillation with Neural Network Architecture-Invariant Generalization MetaDD:通过神经网络架构不变泛化提升数据集蒸馏性能 distillation
25 Improving Object Detection via Local-global Contrastive Learning 提出局部-全局对比学习的图像翻译方法,提升跨域目标检测性能。 contrastive learning

🔬 支柱六:视频提取与匹配 (Video Extraction) (3 篇)

#题目一句话要点标签🔗
26 EgoQR: Efficient QR Code Reading in Egocentric Settings EgoQR:一种高效的以自我为中心的QR码读取系统,适用于可穿戴设备 egocentric
27 EgoOops: A Dataset for Mistake Action Detection from Egocentric Videos referring to Procedural Texts EgoOops:提出一个基于第一视角视频和程序文本的错误动作检测数据集。 egocentric
28 D-PoSE: Depth as an Intermediate Representation for 3D Human Pose and Shape Estimation D-PoSE:利用深度图作为中间表示进行3D人体姿态和形状估计 SMPL SMPL-X

🔬 支柱一:机器人控制 (Robot Control) (2 篇)

#题目一句话要点标签🔗
29 Revealing Directions for Text-guided 3D Face Editing Face Clan:提出一种基于扩散模型的文本引导3D人脸编辑方法 manipulation
30 CAT: Concept-level backdoor ATtacks for Concept Bottleneck Models 提出CAT:针对概念瓶颈模型的概念级后门攻击方法 manipulation

⬅️ 返回 cs.CV 首页 · 🏠 返回主页