cs.CV(2024-12-14)
📊 共 16 篇论文 | 🔗 1 篇有代码
🎯 兴趣领域导航
支柱九:具身大模型 (Embodied Foundation Models) (7)
支柱三:空间感知与语义 (Perception & Semantics) (3)
支柱二:RL算法与架构 (RL & Architecture) (3 🔗1)
支柱四:生成式动作 (Generative Motion) (1)
支柱六:视频提取与匹配 (Video Extraction) (1)
支柱七:动作重定向 (Motion Retargeting) (1)
🔬 支柱九:具身大模型 (Embodied Foundation Models) (7 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 1 | Attention-driven GUI Grounding: Leveraging Pretrained Multimodal Large Language Models without Fine-Tuning | 提出免微调的注意力驱动GUI定位方法,利用预训练多模态大语言模型实现精准GUI组件识别。 | large language model multimodal | ||
| 2 | OmniHD-Scenes: A Next-Generation Multimodal Dataset for Autonomous Driving | OmniHD-Scenes:新一代自动驾驶多模态数据集,助力低成本传感器方案 | multimodal | ||
| 3 | MEATRD: Multimodal Anomalous Tissue Region Detection Enhanced with Spatial Transcriptomics | MEATRD:结合空间转录组学增强的多模态异常组织区域检测 | multimodal | ||
| 4 | Low-Biased General Annotated Dataset Generation | 提出lbGen框架,通过生成低偏差通用数据集提升下游视觉任务泛化能力。 | foundation model multimodal | ||
| 5 | Optimizing Vision-Language Interactions Through Decoder-Only Models | 提出MUDAIF,一种基于解码器的视觉-语言模型,提升效率与跨模态理解。 | multimodal | ||
| 6 | Bridging Vision and Language: Modeling Causality and Temporality in Video Narratives | 提出CTRM模块增强LVLM,建模视频叙事中的因果和时序关系,提升视频描述质量。 | multimodal | ||
| 7 | Just a Few Glances: Open-Set Visual Perception with Image Prompt Paradigm | 提出基于图像提示范式的MI Grounding框架,用于开放集目标检测与分割 | large language model |
🔬 支柱三:空间感知与语义 (Perception & Semantics) (3 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 8 | DCSEG: Decoupled 3D Open-Set Segmentation using Gaussian Splatting | DCSEG:利用高斯溅射解耦的3D开放集分割方法 | gaussian splatting splatting NeRF | ||
| 9 | CFSSeg: Closed-Form Solution for Class-Incremental Semantic Segmentation of 2D Images and 3D Point Clouds | 提出CFSSeg,利用闭式解实现高效的2D图像和3D点云增量语义分割 | scene understanding | ||
| 10 | MAL: Cluster-Masked and Multi-Task Pretraining for Enhanced xLSTM Vision Performance | 提出MAL框架,通过聚类掩码和多任务预训练增强xLSTM视觉性能 | depth estimation |
🔬 支柱二:RL算法与架构 (RL & Architecture) (3 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 11 | Rebalanced Vision-Language Retrieval Considering Structure-Aware Distillation | 提出结构感知蒸馏的重平衡视觉-语言检索方法,解决模态不平衡问题 | distillation geometric consistency | ||
| 12 | Video Representation Learning with Joint-Embedding Predictive Architectures | 提出VJ-VCR,一种基于联合嵌入预测架构的自监督视频表征学习方法,提升了对视频动态的理解。 | representation learning | ||
| 13 | MambaPro: Multi-Modal Object Re-Identification with Mamba Aggregation and Synergistic Prompt | MambaPro:利用Mamba聚合和协同提示进行多模态物体重识别 | Mamba | ✅ |
🔬 支柱四:生成式动作 (Generative Motion) (1 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 14 | SoftVQ-VAE: Efficient 1-Dimensional Continuous Tokenizer | SoftVQ-VAE:一种高效的1维连续图像令牌化方法,显著加速生成模型推理。 | VQ-VAE |
🔬 支柱六:视频提取与匹配 (Video Extraction) (1 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 15 | Detecting Activities of Daily Living in Egocentric Video to Contextualize Hand Use at Home in Outpatient Neurorehabilitation Settings | 提出基于对象交互的活动识别方法,用于神经康复中理解患者居家手部使用情况。 | egocentric |
🔬 支柱七:动作重定向 (Motion Retargeting) (1 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 16 | Enhancing Road Crack Detection Accuracy with BsS-YOLO: Optimizing Feature Fusion and Attention Mechanisms | BsS-YOLO通过优化特征融合与注意力机制提升道路裂缝检测精度 | structure preservation |