cs.CV（2024-10-21）

📊 共 38 篇论文 | 🔗 12 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (15 🔗4) 支柱二：RL算法与架构 (RL & Architecture) (14 🔗4) 支柱三：空间感知与语义 (Perception & Semantics) (5 🔗2) 支柱七：动作重定向 (Motion Retargeting) (2 🔗1) 支柱六：视频提取与匹配 (Video Extraction) (1 🔗1) 支柱一：机器人控制 (Robot Control) (1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (15 篇)

#	题目	一句话要点	标签	🔗	⭐
1	Griffon-G: Bridging Vision-Language and Vision-Centric Tasks via Large Multimodal Models	提出Griffon-G，统一视觉语言和视觉中心任务的大型多模态模型	large language model multimodal instruction following
2	PlaneSAM: Multimodal Plane Instance Segmentation Using the Segment Anything Model	PlaneSAM：利用Segment Anything Model实现多模态平面实例分割	multimodal zero-shot transfer
3	Mini-InternVL: A Flexible-Transfer Pocket Multimodal Model with 5% Parameters and 90% Performance	Mini-InternVL：以5%参数量实现90%性能的灵活迁移多模态模型	large language model multimodal	✅
4	Domain-Adaptive Pre-training of Self-Supervised Foundation Models for Medical Image Classification in Gastrointestinal Endoscopy	提出领域自适应预训练方法，提升胃肠内窥镜医学图像分类性能	foundation model
5	Benchmarking Pathology Foundation Models: Adaptation Strategies and Scenarios	病理学Foundation Model基准测试：针对不同适应策略与应用场景的评估	foundation model	✅
6	Foundation Models for Slide-level Cancer Subtyping in Digital Pathology	利用领域预训练的Foundation Model提升数字病理切片级癌症亚型分类性能	foundation model
7	Joint Top-Down and Bottom-Up Frameworks for 3D Visual Grounding	提出联合自顶向下与自底向上框架，用于提升3D视觉定位性能	visual grounding
8	Multimodal Learning for Embryo Viability Prediction in Clinical IVF	提出一种多模态学习模型，用于临床IVF中胚胎活力预测。	multimodal
9	Beyond Filtering: Adaptive Image-Text Quality Enhancement for MLLM Pretraining	提出自适应图像-文本质量增强器AITQE，用于提升多模态大语言模型预训练数据质量。	large language model multimodal	✅
10	SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree	提出SAM2Long，通过无训练的记忆树优化SAM2在长视频分割中的性能。	foundation model	✅
11	Mitigating Object Hallucination via Concentric Causal Attention	提出同心因果注意力（CCA）以缓解大型视觉语言模型中的对象幻觉问题	multimodal
12	Reducing Hallucinations in Vision-Language Models via Latent Space Steering	提出VTI：通过隐空间引导减少视觉-语言模型中的幻觉问题	large language model
13	Improving Instance Optimization in Deformable Image Registration with Gradient Projection	提出梯度投影的形变图像配准实例优化方法，提升配准精度和稳定性	foundation model
14	When LLMs Learn to be Students: The SOEI Framework for Modeling and Evaluating Virtual Student Agents in Educational Interaction	提出SOEI框架，用于构建和评估教育互动中基于LLM的虚拟学生代理	large language model
15	Deep Learning and Machine Learning -- Object Detection and Semantic Segmentation: From Theory to Applications	综述目标检测与语义分割，结合理论与应用，探索深度学习前沿技术。	large language model

🔬 支柱二：RL算法与架构 (RL & Architecture) (14 篇)

#	题目	一句话要点	标签	🔗	⭐
16	CL-HOI: Cross-Level Human-Object Interaction Distillation from Vision Large Language Models	提出CL-HOI框架，利用视觉大语言模型蒸馏实现无需标注的人-物交互检测	distillation human-object interaction HOI
17	LLaVA-KD: A Framework of Distilling Multimodal Large Language Models	LLaVA-KD：一种用于蒸馏多模态大语言模型的框架	distillation large language model multimodal	✅
18	Few-shot target-driven instance detection based on open-vocabulary object detection models	提出一种轻量级方法，利用开放词汇目标检测模型实现少样本目标驱动的实例检测。	world model open-vocabulary open vocabulary
19	START: A Generalized State Space Model with Saliency-Driven Token-Aware Transformation	提出基于显著性驱动的Token感知变换状态空间模型START，提升域泛化能力。	Mamba SSM state space model	✅
20	MBPU: A Plug-and-Play State Space Model for Point Cloud Upsamping with Fast Point Rendering	提出基于Mamba的MBPU网络，用于大规模点云上采样并减少伪影。	Mamba state space model
21	Exploring Stronger Transformer Representation Learning for Occluded Person Re-Identification	提出SSSC-TransReID，增强Transformer在遮挡场景下行人重识别的特征表达能力	representation learning contrastive learning
22	Joker: Conditional 3D Head Synthesis with Extreme Facial Expressions	Joker：基于条件扩散模型的三维头部极端表情合成	distillation NeRF neural radiance field
23	YOLO11 and Vision Transformers based 3D Pose Estimation of Immature Green Fruits in Commercial Apple Orchards for Robotic Thinning	提出基于YOLO11与Vision Transformer的苹果幼果三维姿态估计方法，用于机器人疏果	MAE depth estimation Depth Anything
24	LMHaze: Intensity-aware Image Dehazing with a Large-scale Multi-intensity Real Haze Dataset	提出LMHaze大规模真实雾霾数据集，并设计MoE-Mamba模型提升图像去雾性能	Mamba multimodal
25	Robust Visual Representation Learning with Multi-modal Prior Knowledge for Image Classification Under Distribution Shift	提出知识引导的视觉表征学习方法KGV，提升图像分类在分布偏移下的泛化能力。	representation learning
26	Learning from Neighbors: Category Extrapolation for Long-Tail Learning	提出基于邻域学习的类别外推方法，解决长尾学习中尾部类别泛化性差的问题。	representation learning large language model
27	Are Large-scale Soft Labels Necessary for Large-scale Dataset Distillation?	提出类内监督的数据集蒸馏方法，显著压缩软标签大小并提升性能。	distillation	✅
28	Contrastive Learning with Auxiliary User Detection for Identifying Activities	提出CLAUDIA框架，通过辅助用户检测的对比学习提升用户和上下文感知的人类活动识别。	contrastive learning
29	TIPS: Text-Image Pretraining with Spatial awareness	提出TIPS以解决图像文本表示学习中的空间意识不足问题	representation learning depth estimation	✅

🔬 支柱三：空间感知与语义 (Perception & Semantics) (5 篇)

#	题目	一句话要点	标签	🔗	⭐
30	3DGS-Enhancer: Enhancing Unbounded 3D Gaussian Splatting with View-consistent 2D Diffusion Priors	3DGS-Enhancer：利用视角一致的2D扩散先验增强无界3D高斯溅射	3D gaussian splatting 3DGS gaussian splatting	✅
31	Fully Explicit Dynamic Gaussian Splatting	提出显式4D高斯溅射(Ex4DGS)用于动态场景快速高质量渲染。	3D gaussian splatting gaussian splatting splatting
32	FrugalNeRF: Fast Convergence for Extreme Few-shot Novel View Synthesis without Learned Priors	FrugalNeRF：无需先验知识，实现极端少样本新视角合成的快速收敛	NeRF neural radiance field scene reconstruction
33	Zero-Shot Scene Reconstruction from Single Images with Deep Prior Assembly	提出深度先验组装框架，实现单张图像零样本场景重建	scene reconstruction	✅
34	Focus on BEV: Self-calibrated Cycle View Transformation for Monocular Birds-Eye-View Segmentation	FocusBEV：单目BEV分割的自校准循环视角变换方法	semantic map spatiotemporal

🔬 支柱七：动作重定向 (Motion Retargeting) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
35	MvDrag3D: Drag-based Creative 3D Editing via Multi-view Generation-Reconstruction Priors	MvDrag3D：基于多视角生成-重建先验的拖拽式创意3D编辑	latent optimization
36	Revisiting Deep Feature Reconstruction for Logical and Structural Industrial Anomaly Detection	提出ULSAD：融合深度特征重建与注意力机制，用于工业逻辑与结构异常检测	spatial relationship	✅

🔬 支柱六：视频提取与匹配 (Video Extraction) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
37	ARTS: Semi-Analytical Regressor using Disentangled Skeletal Representations for Human Mesh Recovery from Videos	ARTS：利用解耦骨骼表示的半解析回归器，用于视频人体网格重建	human mesh recovery human motion	✅

🔬 支柱一：机器人控制 (Robot Control) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
38	DeepIcon: A Hierarchical Network for Layer-wise Icon Vectorization	提出DeepIcon，用于从栅格图像分层矢量化生成可变长度的图标矢量图。	manipulation

⬅️ 返回 cs.CV 首页 · 🏠 返回主页