cs.CV(2024-12-31)

📊 共 16 篇论文 | 🔗 4 篇有代码

🎯 兴趣领域导航

支柱三:空间感知与语义 (Perception & Semantics) (6 🔗2) 支柱九:具身大模型 (Embodied Foundation Models) (6 🔗1) 支柱一:机器人控制 (Robot Control) (2 🔗1) 支柱八:物理动画 (Physics-based Animation) (1) 支柱二:RL算法与架构 (RL & Architecture) (1)

🔬 支柱三:空间感知与语义 (Perception & Semantics) (6 篇)

#题目一句话要点标签🔗
1 OV-HHIR: Open Vocabulary Human Interaction Recognition Using Cross-modal Integration of Large Language Models 提出OV-HHIR框架,利用大语言模型实现开放词汇的人际互动识别,适用于公共安全监控。 open-vocabulary open vocabulary large language model
2 SG-Splatting: Accelerating 3D Gaussian Splatting with Spherical Gaussians SG-Splatting:用球谐高斯加速3D高斯溅射,提升渲染速度与质量 3D gaussian splatting gaussian splatting splatting
3 OVGaussian: Generalizable 3D Gaussian Segmentation with Open Vocabularies 提出OVGaussian以解决3D高斯语义分割的开放词汇问题 3DGS scene understanding semantic map
4 PanoSLAM: Panoptic 3D Scene Reconstruction via Gaussian SLAM PanoSLAM:首个基于高斯SLAM的全景三维场景重建系统 3D gaussian splatting gaussian splatting splatting
5 Gaussian Building Mesh (GBM): Extract a Building's 3D Mesh with Google Earth and Gaussian Splatting 提出基于Google Earth与高斯溅射的建筑物三维网格重建方法(GBM) gaussian splatting splatting
6 STORM: Spatio-Temporal Reconstruction Model for Large-Scale Outdoor Scenes STORM:用于大规模室外场景的时空重建模型,实现高效动态场景重建。 scene reconstruction scene understanding scene flow

🔬 支柱九:具身大模型 (Embodied Foundation Models) (6 篇)

#题目一句话要点标签🔗
7 OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning OCRBench v2:改进的多模态模型视觉文本定位与推理评估基准 multimodal
8 MLLM-as-a-Judge for Image Safety without Human Labeling 提出一种无需人工标注的MLLM图像安全判别方法,解决AIGC内容安全问题 large language model multimodal chain-of-thought
9 VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling VideoChat-Flash:通过分层压缩实现长上下文视频建模,显著降低计算成本。 large language model multimodal
10 VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM 提出VideoRefer Suite,增强Video LLM在时空对象理解方面的能力 large language model
11 CaReBench: A Fine-Grained Benchmark for Video Captioning and Retrieval 提出CaReBench基准测试,用于细粒度视频描述和检索,并评估视频语言模型的时空偏见。 multimodal
12 CRRG-CLIP: Automatic Generation of Chest Radiology Reports and Classification of Chest Radiographs 提出CRRG-CLIP模型,实现胸部X光片报告自动生成与疾病分类 multimodal

🔬 支柱一:机器人控制 (Robot Control) (2 篇)

#题目一句话要点标签🔗
13 Embodied VideoAgent: Persistent Memory from Egocentric Videos and Embodied Sensors Enables Dynamic Scene Understanding Embodied VideoAgent:利用具身视频和传感器进行动态场景理解 manipulation scene understanding egocentric
14 SoundBrush: Sound as a Brush for Visual Scene Editing SoundBrush:提出一种利用声音作为笔刷编辑视觉场景的模型 manipulation

🔬 支柱八:物理动画 (Physics-based Animation) (1 篇)

#题目一句话要点标签🔗
15 Online Video Understanding: OVBench and VideoChat-Online 提出VideoChat-Online,用于在线视频理解,并在OVBench上超越SOTA模型。 spatiotemporal large language model multimodal

🔬 支柱二:RL算法与架构 (RL & Architecture) (1 篇)

#题目一句话要点标签🔗
16 A Novel Convolution and Attention Mechanism-based Model for 6D Object Pose Estimation PoseLecTr:结合Legendre卷积与注意力机制的6D物体姿态估计方法 distillation spatial relationship

⬅️ 返回 cs.CV 首页 · 🏠 返回主页