cs.CV（2024-07-19）

📊 共 29 篇论文 | 🔗 8 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (9 🔗1) 支柱二：RL算法与架构 (RL & Architecture) (8 🔗2) 支柱三：空间感知与语义 (Perception & Semantics) (6 🔗3) 支柱一：机器人控制 (Robot Control) (2 🔗1) 支柱五：交互与反应 (Interaction & Reaction) (1) 支柱七：动作重定向 (Motion Retargeting) (1) 支柱四：生成式动作 (Generative Motion) (1) 支柱六：视频提取与匹配 (Video Extraction) (1 🔗1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (9 篇)

#	题目	一句话要点	标签	🔗	⭐
1	On Pre-training of Multimodal Language Models Customized for Chart Understanding	提出CHOPINLLM，定制多模态大语言模型以提升图表理解能力	large language model multimodal
2	Token-level Correlation-guided Compression for Efficient Multimodal Document Understanding	提出Token级相关性引导压缩方法，提升多模态文档理解效率。	large language model multimodal
3	PD-APE: A Parallel Decoding Framework with Adaptive Position Encoding for 3D Visual Grounding	PD-APE：一种用于3D视觉定位的自适应位置编码并行解码框架	visual grounding
4	Patch-based Intuitive Multimodal Prototypes Network (PIMPNet) for Alzheimer's Disease classification	PIMPNet：基于Patch的多模态原型网络，用于阿尔茨海默病分类	multimodal
5	Visual Text Generation in the Wild	提出SceneVTG，一种在复杂场景下生成高质量、实用文本图像的视觉文本生成器。	large language model multimodal
6	Semantic-CC: Boosting Remote Sensing Image Change Captioning via Foundational Knowledge and Semantic Guidance	提出Semantic-CC，利用基础知识和语义引导提升遥感图像变化描述效果。	large language model foundation model
7	EVLM: An Efficient Vision-Language Model for Visual Understanding	提出EVLM：一种高效的视觉-语言模型，用于提升视觉理解能力	large language model
8	Seismic Fault SAM: Adapting SAM with Lightweight Modules and 2.5D Strategy for Fault Detection	Seismic Fault SAM：利用轻量级模块和2.5D策略改进SAM用于地震断层检测	foundation model
9	Img2CAD: Reverse Engineering 3D CAD Models from Images through VLM-Assisted Conditional Factorization	提出基于VLM辅助条件分解的Img2CAD方法，从图像逆向工程3D CAD模型。	foundation model	✅

🔬 支柱二：RL算法与架构 (RL & Architecture) (8 篇)

#	题目	一句话要点	标签	🔗	⭐
10	Multi-modal Relation Distillation for Unified 3D Representation Learning	提出多模态关系蒸馏(MRD)框架，提升3D表示学习的零样本分类和跨模态检索性能。	representation learning distillation
11	Contrastive Learning with Counterfactual Explanations for Radiology Report Generation	提出基于反事实解释的对比学习框架CoFE，用于提升放射影像报告生成质量。	contrastive learning large language model
12	Dataset Distillation by Automatic Training Trajectories	提出ATT方法，通过自适应训练轨迹解决数据集蒸馏中的累积失配问题。	distillation AMP
13	Self-Supervised Video Representation Learning in a Heuristic Decoupled Perspective	提出BOLD-DI，解耦静态与动态语义，提升自监督视频表征学习能力	representation learning contrastive learning
14	PlacidDreamer: Advancing Harmony in Text-to-3D Generation	PlacidDreamer：提出一种和谐的文本到3D生成框架，解决生成方向冲突和过度饱和问题。	dreamer distillation	✅
15	Improving classification of road surface conditions via road area extraction and contrastive learning	提出基于道路区域提取和对比学习的道路表面状况分类方法，提升分类性能并降低计算成本。	contrastive learning
16	An Attention-based Representation Distillation Baseline for Multi-Label Continual Learning	提出基于注意力蒸馏的SCAD方法，解决多标签持续学习中的灾难性遗忘问题。	distillation	✅
17	Semi-supervised reference-based sketch extraction using a contrastive learning framework	提出基于对比学习的半监督参考素描提取方法，解决风格迁移素描生成难题	contrastive learning

🔬 支柱三：空间感知与语义 (Perception & Semantics) (6 篇)

#	题目	一句话要点	标签	🔗	⭐
18	OpenSU3D: Open World 3D Scene Understanding using Foundation Models	OpenSU3D：利用基础模型构建开放世界三维场景理解	scene understanding large language model foundation model
19	A Benchmark for Gaussian Splatting Compression and Quality Assessment Study	提出基于图的GS压缩方法GGSC，并构建GS质量评估数据集GSQA。	gaussian splatting splatting	✅
20	Mono-ViFI: A Unified Learning Framework for Self-supervised Single- and Multi-frame Monocular Depth Estimation	Mono-ViFI：用于自监督单目深度估计的统一学习框架	depth estimation monocular depth	✅
21	GaussianBeV: 3D Gaussian Representation meets Perception Models for BeV Segmentation	GaussianBeV：提出基于3D高斯表示的BeV分割新方法，刷新nuScenes数据集SOTA。	gaussian splatting splatting scene understanding
22	Bidirectional Regression for Monocular 6DoF Head Pose Estimation and Reference System Alignment	提出TRGv2网络，通过双向回归和参考系对齐提升单目6DoF头部姿态估计精度。	depth estimation
23	MC-PanDA: Mask Confidence for Panoptic Domain Adaptation	MC-PanDA利用Mask Transformer置信度进行泛视角领域自适应，显著提升分割性能。	scene understanding	✅

🔬 支柱一：机器人控制 (Robot Control) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
24	Decomposed Vector-Quantized Variational Autoencoder for Human Grasp Generation	提出分解向量量化变分自编码器以解决人类抓取生成问题	manipulation VQ-VAE	✅
25	How to Blend Concepts in Diffusion Models	探索扩散模型中的概念融合方法，通过文本提示的潜在空间操作实现图像生成。	manipulation

🔬 支柱五：交互与反应 (Interaction & Reaction) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
26	Kinematics-based 3D Human-Object Interaction Reconstruction from Single View	提出基于运动学的单视角3D人-物交互重建方法，解决遮挡下的姿态估计问题	human-object interaction HOI

🔬 支柱七：动作重定向 (Motion Retargeting) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
27	T2V-CompBench: A Comprehensive Benchmark for Compositional Text-to-video Generation	提出T2V-CompBench，用于全面评估组合文本到视频生成模型的性能。	spatial relationship large language model multimodal

🔬 支柱四：生成式动作 (Generative Motion) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
28	M2D2M: Multi-Motion Generation from Text with Discrete Diffusion Models	提出M2D2M，利用离散扩散模型生成多动作文本驱动的人体运动	motion generation

🔬 支柱六：视频提取与匹配 (Video Extraction) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
29	Temporal Correlation Meets Embedding: Towards a 2nd Generation of JDE-based Real-Time Multi-Object Tracking	TCBTrack：利用时序相关性和轻量级嵌入，实现第二代JDE实时多目标跟踪	feature matching	✅

⬅️ 返回 cs.CV 首页 · 🏠 返回主页