cs.CV（2024-10-02）

📊 共 16 篇论文 | 🔗 4 篇有代码

🎯 兴趣领域导航

支柱三：空间感知与语义 (Perception & Semantics) (7 🔗2) 支柱九：具身大模型 (Embodied Foundation Models) (7 🔗2) 支柱二：RL算法与架构 (RL & Architecture) (1) 支柱六：视频提取与匹配 (Video Extraction) (1)

🔬 支柱三：空间感知与语义 (Perception & Semantics) (7 篇)

#	题目	一句话要点	标签	🔗	⭐
1	3DGS-DET: Empower 3D Gaussian Splatting with Boundary Guidance and Box-Focused Sampling for 3D Object Detection	3DGS-DET：利用边界引导和Box聚焦采样增强3D高斯溅射用于3D目标检测	3D gaussian splatting 3DGS gaussian splatting
2	Multi-viewregulated gaussian splatting for novel view synthesis	提出多视角约束高斯溅射方法，提升新视角合成质量和几何精度。	3D gaussian splatting 3DGS gaussian splatting
3	Depth Pro: Sharp Monocular Metric Depth in Less Than a Second	Depth Pro：亚秒级生成高精度单目度量深度图	depth estimation monocular depth metric depth	✅
4	Open3DTrack: Towards Open-Vocabulary 3D Multi-Object Tracking	提出Open3DTrack，解决开放词汇3D多目标跟踪问题，提升自动驾驶环境感知能力。	open-vocabulary open vocabulary
5	SegEarth-OV: Towards Training-Free Open-Vocabulary Segmentation for Remote Sensing Images	提出SegEarth-OV，实现遥感图像的免训练开放词汇分割	open-vocabulary open vocabulary	✅
6	EVER: Exact Volumetric Ellipsoid Rendering for Real-time View Synthesis	提出EVER：一种用于实时视角合成的精确体积椭球渲染方法	3D gaussian splatting 3DGS gaussian splatting
7	Neural Eulerian Scene Flow Fields	EulerFlow：基于神经先验的连续时空ODE场景流估计	scene flow

🔬 支柱九：具身大模型 (Embodied Foundation Models) (7 篇)

#	题目	一句话要点	标签	🔗	⭐
8	UlcerGPT: A Multimodal Approach Leveraging Large Language and Vision Models for Diabetic Foot Ulcer Image Transcription	UlcerGPT：利用大型语言和视觉模型进行糖尿病足溃疡图像转录的多模态方法	large language model multimodal
9	Semi-Supervised Fine-Tuning of Vision Foundation Models with Content-Style Decomposition	提出基于内容-风格分解的半监督微调方法，提升视觉基础模型在低标注数据下的性能。	foundation model
10	Quantifying the Gaps Between Translation and Native Perception in Training for Multimodal, Multilingual Retrieval	量化多模态多语言检索中翻译文本与原生感知的差距，并提出数据增强策略。	multimodal
11	EMMA: Efficient Visual Alignment in Multi-Modal LLMs	提出EMMA：一种高效的多模态LLM视觉对齐方法，提升任务适应性。	large language model foundation model	✅
12	Leopard: A Vision Language Model For Text-Rich Multi-Image Tasks	Leopard：面向富文本多图任务的视觉语言模型	large language model multimodal	✅
13	Emo3D: Metric and Benchmarking Dataset for 3D Facial Expression Generation from Emotion Description	Emo3D：提出用于3D面部表情生成的度量与基准数据集，并提出新的评估指标。	large language model
14	Robust Modality-incomplete Anomaly Detection: A Modality-instructive Framework with Benchmark	提出RADAR框架，解决模态缺失下的鲁棒工业异常检测问题	multimodal

🔬 支柱二：RL算法与架构 (RL & Architecture) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
15	LaGeM: A Large Geometry Model for 3D Representation Learning and Diffusion	LaGeM：提出一种用于3D表示学习和扩散的大型几何模型，解决大规模数据集和生成建模的挑战。	representation learning

🔬 支柱六：视频提取与匹配 (Video Extraction) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
16	Enhancing Screen Time Identification in Children with a Multi-View Vision Language Model and Screen Time Tracker	提出基于多视角视觉语言模型的儿童屏幕时间识别框架，提升自然场景下的监测精度。	egocentric

⬅️ 返回 cs.CV 首页 · 🏠 返回主页