cs.CV(2024-11-04)

📊 共 27 篇论文 | 🔗 10 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (10 🔗3) 支柱三:空间感知与语义 (Perception & Semantics) (9 🔗3) 支柱二:RL算法与架构 (RL & Architecture) (5 🔗2) 支柱六:视频提取与匹配 (Video Extraction) (2 🔗1) 支柱一:机器人控制 (Robot Control) (1 🔗1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (10 篇)

#题目一句话要点标签🔗
1 ChatTracker: Enhancing Visual Tracking Performance via Chatting with Multimodal Large Language Model ChatTracker:利用多模态大语言模型提升视觉跟踪性能 large language model multimodal
2 KptLLM: Unveiling the Power of Large Language Model for Keypoint Comprehension 提出KptLLM,利用大语言模型进行关键点语义理解,解决像素级语义细节捕捉难题。 large language model multimodal chain-of-thought
3 Digi2Real: Bridging the Realism Gap in Synthetic Data Face Recognition via Foundation Models Digi2Real:利用人脸基础模型弥合合成数据人脸识别的真实感差距 foundation model
4 A Novel Deep Learning Tractography Fiber Clustering Framework for Functionally Consistent White Matter Parcellation Using Multimodal Diffusion MRI and Functional MRI 提出Deep Multi-view Fiber Clustering (DMVFC)框架,用于功能一致的白质分割。 multimodal
5 3D Audio-Visual Segmentation 提出EchoSegnet,解决3D场景中基于声音的物体分割问题。 embodied AI foundation model
6 Multi-Transmotion: Pre-trained Model for Human Motion Prediction Multi-Transmotion:用于人体运动预测的跨模态预训练模型 multimodal
7 Adaptive Length Image Tokenization via Recurrent Allocation 提出基于循环分配的自适应长度图像Token化方法,提升视觉系统表征效率。 large language model
8 AM Flow: Adapters for Temporal Processing in Action Recognition 提出AM Flow和时间处理适配器,提升图像模型在动作识别中的时序建模能力。 foundation model
9 SPECTRUM: Semantic Processing and Emotion-informed video-Captioning Through Retrieval and Understanding Modalities SPECTRUM:提出一种融合语义处理和情感信息的视频字幕生成框架。 multimodal
10 Learning Where to Edit Vision Transformers 提出基于超网络的ViT编辑方法,提升模型在子群体偏移下的泛化性和局部性。 large language model

🔬 支柱三:空间感知与语义 (Perception & Semantics) (9 篇)

#题目一句话要点标签🔗
11 FewViewGS: Gaussian Splatting with Few View Matching and Multi-stage Training FewViewGS:基于少量视图匹配和多阶段训练的高斯溅射,提升稀疏图像下的新视角合成效果 depth estimation 3D gaussian splatting gaussian splatting
12 GVKF: Gaussian Voxel Kernel Functions for Highly Efficient Surface Reconstruction in Open Scenes 提出高斯体素核函数,高效重建开放场景三维表面 3D gaussian splatting 3DGS gaussian splatting
13 Improving Domain Generalization in Self-supervised Monocular Depth Estimation via Stabilized Adversarial Training 提出SCAT框架,通过稳定对抗训练提升自监督单目深度估计的领域泛化性 depth estimation monocular depth
14 Exploiting Unlabeled Data with Multiple Expert Teachers for Open Vocabulary Aerial Object Detection and Its Orientation Adaptation 提出CastDet,解决开放词汇空中目标检测中弱特征和任意方向问题。 open-vocabulary open vocabulary
15 PMPNet: Pixel Movement Prediction Network for Monocular Depth Estimation in Dynamic Scenes PMPNet:动态场景下单目深度估计的像素运动预测网络 depth estimation monocular depth
16 A Probabilistic Formulation of LiDAR Mapping with Neural Radiance Fields 提出基于概率的NeRF LiDAR建图方法,解决多重反射导致的幻影表面问题 NeRF neural radiance field PULSE
17 Map++: Towards User-Participatory Visual SLAM Systems with Efficient Map Expansion and Sharing Map++:面向用户参与的视觉SLAM系统,实现高效地图扩展与共享 visual SLAM
18 Communicate Less, Synthesize the Rest: Latency-aware Intent-based Generative Semantic Multicasting with Diffusion Models 提出延迟感知的意图驱动生成语义组播框架,利用扩散模型减少通信量。 semantic map
19 Multi-task Geometric Estimation of Depth and Surface Normal from Monocular 360° Images 提出一种多任务学习网络,用于单目360°图像的深度和表面法线几何估计。 scene understanding

🔬 支柱二:RL算法与架构 (RL & Architecture) (5 篇)

#题目一句话要点标签🔗
20 Learning General-Purpose Biomedical Volume Representations using Randomized Synthesis 提出基于随机合成的通用生物医学体数据表征学习方法,提升模型泛化性。 representation learning contrastive learning foundation model
21 PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance PPLLaVA:提出提示引导的池化策略,实现短视频与长视频的统一理解。 DPO direct preference optimization large language model
22 Masked Autoencoders are Parameter-Efficient Federated Continual Learners 提出pMAE:一种参数高效的联邦持续学习方法,解决灾难性遗忘和非独立同分布问题。 masked autoencoder MAE
23 How Far is Video Generation from World Model: A Physical Law Perspective 通过物理定律视角评估视频生成模型的世界模型能力与泛化机制 world model
24 Rotation Perturbation Robustness in Point Cloud Analysis: A Perspective of Manifold Distillation 提出基于流形蒸馏的点云旋转扰动鲁棒性方法 distillation

🔬 支柱六:视频提取与匹配 (Video Extraction) (2 篇)

#题目一句话要点标签🔗
25 TI-PREGO: Chain of Thought and In-Context Learning for Online Mistake Detection in PRocedural EGOcentric Videos TI-PREGO:利用思维链和上下文学习进行程序性第一视角视频中的在线错误检测 egocentric large language model chain-of-thought
26 Semantic-Aligned Adversarial Evolution Triangle for High-Transferability Vision-Language Attack 提出语义对齐对抗演化三角方法,提升视觉-语言模型对抗样本的迁移性 feature matching multimodal

🔬 支柱一:机器人控制 (Robot Control) (1 篇)

#题目一句话要点标签🔗
27 Training-free Regional Prompting for Diffusion Transformers 提出一种免训练的区域提示方法,提升Diffusion Transformer在复杂文本生成中的精细控制能力。 manipulation spatial relationship large language model

⬅️ 返回 cs.CV 首页 · 🏠 返回主页