cs.CV(2024-12-31)
📊 共 16 篇论文 | 🔗 4 篇有代码
🎯 兴趣领域导航
支柱三:空间感知与语义 (Perception & Semantics) (6 🔗2)
支柱九:具身大模型 (Embodied Foundation Models) (6 🔗1)
支柱一:机器人控制 (Robot Control) (2 🔗1)
支柱八:物理动画 (Physics-based Animation) (1)
支柱二:RL算法与架构 (RL & Architecture) (1)
🔬 支柱三:空间感知与语义 (Perception & Semantics) (6 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 1 | OV-HHIR: Open Vocabulary Human Interaction Recognition Using Cross-modal Integration of Large Language Models | 提出OV-HHIR框架,利用大语言模型实现开放词汇的人际互动识别,适用于公共安全监控。 | open-vocabulary open vocabulary large language model | ||
| 2 | SG-Splatting: Accelerating 3D Gaussian Splatting with Spherical Gaussians | SG-Splatting:用球谐高斯加速3D高斯溅射,提升渲染速度与质量 | 3D gaussian splatting gaussian splatting splatting | ||
| 3 | OVGaussian: Generalizable 3D Gaussian Segmentation with Open Vocabularies | 提出OVGaussian以解决3D高斯语义分割的开放词汇问题 | 3DGS scene understanding semantic map | ✅ | |
| 4 | PanoSLAM: Panoptic 3D Scene Reconstruction via Gaussian SLAM | PanoSLAM:首个基于高斯SLAM的全景三维场景重建系统 | 3D gaussian splatting gaussian splatting splatting | ✅ | |
| 5 | Gaussian Building Mesh (GBM): Extract a Building's 3D Mesh with Google Earth and Gaussian Splatting | 提出基于Google Earth与高斯溅射的建筑物三维网格重建方法(GBM) | gaussian splatting splatting | ||
| 6 | STORM: Spatio-Temporal Reconstruction Model for Large-Scale Outdoor Scenes | STORM:用于大规模室外场景的时空重建模型,实现高效动态场景重建。 | scene reconstruction scene understanding scene flow |
🔬 支柱九:具身大模型 (Embodied Foundation Models) (6 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 7 | OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning | OCRBench v2:改进的多模态模型视觉文本定位与推理评估基准 | multimodal | ✅ | |
| 8 | MLLM-as-a-Judge for Image Safety without Human Labeling | 提出一种无需人工标注的MLLM图像安全判别方法,解决AIGC内容安全问题 | large language model multimodal chain-of-thought | ||
| 9 | VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling | VideoChat-Flash:通过分层压缩实现长上下文视频建模,显著降低计算成本。 | large language model multimodal | ||
| 10 | VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM | 提出VideoRefer Suite,增强Video LLM在时空对象理解方面的能力 | large language model | ||
| 11 | CaReBench: A Fine-Grained Benchmark for Video Captioning and Retrieval | 提出CaReBench基准测试,用于细粒度视频描述和检索,并评估视频语言模型的时空偏见。 | multimodal | ||
| 12 | CRRG-CLIP: Automatic Generation of Chest Radiology Reports and Classification of Chest Radiographs | 提出CRRG-CLIP模型,实现胸部X光片报告自动生成与疾病分类 | multimodal |
🔬 支柱一:机器人控制 (Robot Control) (2 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 13 | Embodied VideoAgent: Persistent Memory from Egocentric Videos and Embodied Sensors Enables Dynamic Scene Understanding | Embodied VideoAgent:利用具身视频和传感器进行动态场景理解 | manipulation scene understanding egocentric | ||
| 14 | SoundBrush: Sound as a Brush for Visual Scene Editing | SoundBrush:提出一种利用声音作为笔刷编辑视觉场景的模型 | manipulation | ✅ |
🔬 支柱八:物理动画 (Physics-based Animation) (1 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 15 | Online Video Understanding: OVBench and VideoChat-Online | 提出VideoChat-Online,用于在线视频理解,并在OVBench上超越SOTA模型。 | spatiotemporal large language model multimodal |
🔬 支柱二:RL算法与架构 (RL & Architecture) (1 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 16 | A Novel Convolution and Attention Mechanism-based Model for 6D Object Pose Estimation | PoseLecTr:结合Legendre卷积与注意力机制的6D物体姿态估计方法 | distillation spatial relationship |