cs.CV(2024-12-14)

📊 共 16 篇论文 | 🔗 1 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (7) 支柱三:空间感知与语义 (Perception & Semantics) (3) 支柱二:RL算法与架构 (RL & Architecture) (3 🔗1) 支柱四:生成式动作 (Generative Motion) (1) 支柱六:视频提取与匹配 (Video Extraction) (1) 支柱七:动作重定向 (Motion Retargeting) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (7 篇)

#题目一句话要点标签🔗
1 Attention-driven GUI Grounding: Leveraging Pretrained Multimodal Large Language Models without Fine-Tuning 提出免微调的注意力驱动GUI定位方法,利用预训练多模态大语言模型实现精准GUI组件识别。 large language model multimodal
2 OmniHD-Scenes: A Next-Generation Multimodal Dataset for Autonomous Driving OmniHD-Scenes:新一代自动驾驶多模态数据集,助力低成本传感器方案 multimodal
3 MEATRD: Multimodal Anomalous Tissue Region Detection Enhanced with Spatial Transcriptomics MEATRD:结合空间转录组学增强的多模态异常组织区域检测 multimodal
4 Low-Biased General Annotated Dataset Generation 提出lbGen框架,通过生成低偏差通用数据集提升下游视觉任务泛化能力。 foundation model multimodal
5 Optimizing Vision-Language Interactions Through Decoder-Only Models 提出MUDAIF,一种基于解码器的视觉-语言模型,提升效率与跨模态理解。 multimodal
6 Bridging Vision and Language: Modeling Causality and Temporality in Video Narratives 提出CTRM模块增强LVLM,建模视频叙事中的因果和时序关系,提升视频描述质量。 multimodal
7 Just a Few Glances: Open-Set Visual Perception with Image Prompt Paradigm 提出基于图像提示范式的MI Grounding框架,用于开放集目标检测与分割 large language model

🔬 支柱三:空间感知与语义 (Perception & Semantics) (3 篇)

#题目一句话要点标签🔗
8 DCSEG: Decoupled 3D Open-Set Segmentation using Gaussian Splatting DCSEG:利用高斯溅射解耦的3D开放集分割方法 gaussian splatting splatting NeRF
9 CFSSeg: Closed-Form Solution for Class-Incremental Semantic Segmentation of 2D Images and 3D Point Clouds 提出CFSSeg,利用闭式解实现高效的2D图像和3D点云增量语义分割 scene understanding
10 MAL: Cluster-Masked and Multi-Task Pretraining for Enhanced xLSTM Vision Performance 提出MAL框架,通过聚类掩码和多任务预训练增强xLSTM视觉性能 depth estimation

🔬 支柱二:RL算法与架构 (RL & Architecture) (3 篇)

#题目一句话要点标签🔗
11 Rebalanced Vision-Language Retrieval Considering Structure-Aware Distillation 提出结构感知蒸馏的重平衡视觉-语言检索方法,解决模态不平衡问题 distillation geometric consistency
12 Video Representation Learning with Joint-Embedding Predictive Architectures 提出VJ-VCR,一种基于联合嵌入预测架构的自监督视频表征学习方法,提升了对视频动态的理解。 representation learning
13 MambaPro: Multi-Modal Object Re-Identification with Mamba Aggregation and Synergistic Prompt MambaPro:利用Mamba聚合和协同提示进行多模态物体重识别 Mamba

🔬 支柱四:生成式动作 (Generative Motion) (1 篇)

#题目一句话要点标签🔗
14 SoftVQ-VAE: Efficient 1-Dimensional Continuous Tokenizer SoftVQ-VAE:一种高效的1维连续图像令牌化方法,显著加速生成模型推理。 VQ-VAE

🔬 支柱六:视频提取与匹配 (Video Extraction) (1 篇)

#题目一句话要点标签🔗
15 Detecting Activities of Daily Living in Egocentric Video to Contextualize Hand Use at Home in Outpatient Neurorehabilitation Settings 提出基于对象交互的活动识别方法,用于神经康复中理解患者居家手部使用情况。 egocentric

🔬 支柱七:动作重定向 (Motion Retargeting) (1 篇)

#题目一句话要点标签🔗
16 Enhancing Road Crack Detection Accuracy with BsS-YOLO: Optimizing Feature Fusion and Attention Mechanisms BsS-YOLO通过优化特征融合与注意力机制提升道路裂缝检测精度 structure preservation

⬅️ 返回 cs.CV 首页 · 🏠 返回主页