cs.CV(2024-07-19)

📊 共 29 篇论文 | 🔗 8 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (9 🔗1) 支柱二:RL算法与架构 (RL & Architecture) (8 🔗2) 支柱三:空间感知与语义 (Perception & Semantics) (6 🔗3) 支柱一:机器人控制 (Robot Control) (2 🔗1) 支柱五:交互与反应 (Interaction & Reaction) (1) 支柱七:动作重定向 (Motion Retargeting) (1) 支柱四:生成式动作 (Generative Motion) (1) 支柱六:视频提取与匹配 (Video Extraction) (1 🔗1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (9 篇)

#题目一句话要点标签🔗
1 On Pre-training of Multimodal Language Models Customized for Chart Understanding 提出CHOPINLLM,定制多模态大语言模型以提升图表理解能力 large language model multimodal
2 Token-level Correlation-guided Compression for Efficient Multimodal Document Understanding 提出Token级相关性引导压缩方法,提升多模态文档理解效率。 large language model multimodal
3 PD-APE: A Parallel Decoding Framework with Adaptive Position Encoding for 3D Visual Grounding PD-APE:一种用于3D视觉定位的自适应位置编码并行解码框架 visual grounding
4 Patch-based Intuitive Multimodal Prototypes Network (PIMPNet) for Alzheimer's Disease classification PIMPNet:基于Patch的多模态原型网络,用于阿尔茨海默病分类 multimodal
5 Visual Text Generation in the Wild 提出SceneVTG,一种在复杂场景下生成高质量、实用文本图像的视觉文本生成器。 large language model multimodal
6 Semantic-CC: Boosting Remote Sensing Image Change Captioning via Foundational Knowledge and Semantic Guidance 提出Semantic-CC,利用基础知识和语义引导提升遥感图像变化描述效果。 large language model foundation model
7 EVLM: An Efficient Vision-Language Model for Visual Understanding 提出EVLM:一种高效的视觉-语言模型,用于提升视觉理解能力 large language model
8 Seismic Fault SAM: Adapting SAM with Lightweight Modules and 2.5D Strategy for Fault Detection Seismic Fault SAM:利用轻量级模块和2.5D策略改进SAM用于地震断层检测 foundation model
9 Img2CAD: Reverse Engineering 3D CAD Models from Images through VLM-Assisted Conditional Factorization 提出基于VLM辅助条件分解的Img2CAD方法,从图像逆向工程3D CAD模型。 foundation model

🔬 支柱二:RL算法与架构 (RL & Architecture) (8 篇)

#题目一句话要点标签🔗
10 Multi-modal Relation Distillation for Unified 3D Representation Learning 提出多模态关系蒸馏(MRD)框架,提升3D表示学习的零样本分类和跨模态检索性能。 representation learning distillation
11 Contrastive Learning with Counterfactual Explanations for Radiology Report Generation 提出基于反事实解释的对比学习框架CoFE,用于提升放射影像报告生成质量。 contrastive learning large language model
12 Dataset Distillation by Automatic Training Trajectories 提出ATT方法,通过自适应训练轨迹解决数据集蒸馏中的累积失配问题。 distillation AMP
13 Self-Supervised Video Representation Learning in a Heuristic Decoupled Perspective 提出BOLD-DI,解耦静态与动态语义,提升自监督视频表征学习能力 representation learning contrastive learning
14 PlacidDreamer: Advancing Harmony in Text-to-3D Generation PlacidDreamer:提出一种和谐的文本到3D生成框架,解决生成方向冲突和过度饱和问题。 dreamer distillation
15 Improving classification of road surface conditions via road area extraction and contrastive learning 提出基于道路区域提取和对比学习的道路表面状况分类方法,提升分类性能并降低计算成本。 contrastive learning
16 An Attention-based Representation Distillation Baseline for Multi-Label Continual Learning 提出基于注意力蒸馏的SCAD方法,解决多标签持续学习中的灾难性遗忘问题。 distillation
17 Semi-supervised reference-based sketch extraction using a contrastive learning framework 提出基于对比学习的半监督参考素描提取方法,解决风格迁移素描生成难题 contrastive learning

🔬 支柱三:空间感知与语义 (Perception & Semantics) (6 篇)

#题目一句话要点标签🔗
18 OpenSU3D: Open World 3D Scene Understanding using Foundation Models OpenSU3D:利用基础模型构建开放世界三维场景理解 scene understanding large language model foundation model
19 A Benchmark for Gaussian Splatting Compression and Quality Assessment Study 提出基于图的GS压缩方法GGSC,并构建GS质量评估数据集GSQA。 gaussian splatting splatting
20 Mono-ViFI: A Unified Learning Framework for Self-supervised Single- and Multi-frame Monocular Depth Estimation Mono-ViFI:用于自监督单目深度估计的统一学习框架 depth estimation monocular depth
21 GaussianBeV: 3D Gaussian Representation meets Perception Models for BeV Segmentation GaussianBeV:提出基于3D高斯表示的BeV分割新方法,刷新nuScenes数据集SOTA。 gaussian splatting splatting scene understanding
22 Bidirectional Regression for Monocular 6DoF Head Pose Estimation and Reference System Alignment 提出TRGv2网络,通过双向回归和参考系对齐提升单目6DoF头部姿态估计精度。 depth estimation
23 MC-PanDA: Mask Confidence for Panoptic Domain Adaptation MC-PanDA利用Mask Transformer置信度进行泛视角领域自适应,显著提升分割性能。 scene understanding

🔬 支柱一:机器人控制 (Robot Control) (2 篇)

#题目一句话要点标签🔗
24 Decomposed Vector-Quantized Variational Autoencoder for Human Grasp Generation 提出分解向量量化变分自编码器以解决人类抓取生成问题 manipulation VQ-VAE
25 How to Blend Concepts in Diffusion Models 探索扩散模型中的概念融合方法,通过文本提示的潜在空间操作实现图像生成。 manipulation

🔬 支柱五:交互与反应 (Interaction & Reaction) (1 篇)

#题目一句话要点标签🔗
26 Kinematics-based 3D Human-Object Interaction Reconstruction from Single View 提出基于运动学的单视角3D人-物交互重建方法,解决遮挡下的姿态估计问题 human-object interaction HOI

🔬 支柱七:动作重定向 (Motion Retargeting) (1 篇)

#题目一句话要点标签🔗
27 T2V-CompBench: A Comprehensive Benchmark for Compositional Text-to-video Generation 提出T2V-CompBench,用于全面评估组合文本到视频生成模型的性能。 spatial relationship large language model multimodal

🔬 支柱四:生成式动作 (Generative Motion) (1 篇)

#题目一句话要点标签🔗
28 M2D2M: Multi-Motion Generation from Text with Discrete Diffusion Models 提出M2D2M,利用离散扩散模型生成多动作文本驱动的人体运动 motion generation

🔬 支柱六:视频提取与匹配 (Video Extraction) (1 篇)

#题目一句话要点标签🔗
29 Temporal Correlation Meets Embedding: Towards a 2nd Generation of JDE-based Real-Time Multi-Object Tracking TCBTrack:利用时序相关性和轻量级嵌入,实现第二代JDE实时多目标跟踪 feature matching

⬅️ 返回 cs.CV 首页 · 🏠 返回主页