cs.CV(2025-01-16)

📊 共 26 篇论文 | 🔗 8 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (9 🔗4) 支柱二:RL算法与架构 (RL & Architecture) (8 🔗1) 支柱三:空间感知与语义 (Perception & Semantics) (7 🔗2) 支柱一:机器人控制 (Robot Control) (1) 支柱四:生成式动作 (Generative Motion) (1 🔗1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (9 篇)

#题目一句话要点标签🔗
1 A Simple Aerial Detection Baseline of Multimodal Language Models 提出LMMRotate,首次探索多模态语言模型在遥感图像目标检测中的应用 foundation model multimodal visual grounding
2 Omni-Emotion: Extending Video MLLM with Detailed Face and Audio Modeling for Multimodal Emotion Analysis Omni-Emotion:通过细粒度人脸和音频建模扩展视频MLLM,用于多模态情感分析 large language model multimodal
3 AugRefer: Advancing 3D Visual Grounding via Cross-Modal Augmentation and Spatial Relation-based Referring AugRefer:通过跨模态增强和空间关系引导,提升3D视觉定位性能 foundation model visual grounding
4 Text-driven Adaptation of Foundation Models for Few-shot Surgical Workflow Analysis 提出Surg-FTDA,用于少量样本的手术流程分析,降低数据依赖。 foundation model
5 Scaling up self-supervised learning for improved surgical foundation models 提出SurgeNetXL,通过大规模自监督学习显著提升手术计算机视觉的基础模型性能。 foundation model
6 SMPLest-X: Ultimate Scaling for Expressive Human Pose and Shape Estimation SMPLest-X:通过极致扩展实现富有表现力的人体姿态和形状估计 foundation model
7 Lost in Translation, Found in Context: Sign Language Translation with Contextual Cues 提出结合上下文线索的手语翻译框架,提升翻译准确性 large language model
8 CHIRP: A Fine-Grained Benchmark for Open-Ended Response Evaluation in Vision-Language Models 提出CHIRP基准,用于细粒度评估视觉-语言模型开放式响应生成能力 large language model
9 ASCENT-ViT: Attention-based Scale-aware Concept Learning Framework for Enhanced Alignment in Vision Transformers 提出ASCENT-ViT,通过注意力机制和尺度感知概念学习增强ViT的可解释性。 foundation model

🔬 支柱二:RL算法与架构 (RL & Architecture) (8 篇)

#题目一句话要点标签🔗
10 WMamba: Wavelet-based Mamba for Face Forgery Detection WMamba:基于小波变换的Mamba架构用于人脸伪造检测 Mamba spatial relationship
11 VideoWorld: Exploring Knowledge Learning from Unlabeled Videos VideoWorld:探索从无标签视频中学习知识的深度生成模型 reinforcement learning latent dynamics large language model
12 Mitigating Hallucinations in Large Vision-Language Models via DPO: On-Policy Data Hold the Key OPA-DPO:通过On-Policy数据对齐缓解大型视觉语言模型中的幻觉问题 DPO direct preference optimization
13 The Devil is in the Details: Simple Remedies for Image-to-LiDAR Representation Learning 针对图像到LiDAR表示学习,通过优化坐标系、量化和数据利用提升性能。 representation learning distillation
14 Strategic Base Representation Learning via Feature Augmentations for Few-Shot Class Incremental Learning 提出基于特征增强的对比学习框架,解决少样本类增量学习中的类别区分问题。 representation learning contrastive learning
15 Soft Knowledge Distillation with Multi-Dimensional Cross-Net Attention for Image Restoration Models Compression 提出基于多维交叉注意力软知识蒸馏的图像修复模型压缩方法 contrastive learning distillation
16 Knowledge Distillation for Image Restoration : Simultaneous Learning from Degraded and Clean Images 提出SLKD框架,通过双教师知识蒸馏压缩图像复原模型,显著降低计算量。 DRL distillation
17 Towards Robust and Realistic Human Pose Estimation via WiFi Signals 提出DT-Pose框架,解决WiFi信号人体姿态估计中的跨域和结构保真度问题 representation learning contrastive learning

🔬 支柱三:空间感知与语义 (Perception & Semantics) (7 篇)

#题目一句话要点标签🔗
18 Creating Virtual Environments with 3D Gaussian Splatting: A Comparative Study 基于3D高斯溅射的虚拟环境创建方法比较研究 3D gaussian splatting 3DGS gaussian splatting
19 DEFOM-Stereo: Depth Foundation Model Based Stereo Matching DEFOM-Stereo:基于深度基础模型的立体匹配方法,提升零样本泛化能力。 depth estimation monocular depth metric depth
20 Are Open-Vocabulary Models Ready for Detection of MEP Elements on Construction Sites 评估开放词汇模型在建筑工地MEP元件检测中的适用性 open-vocabulary open vocabulary
21 Fine-Grained Image-Text Correspondence with Cost Aggregation for Open-Vocabulary Part Segmentation PartCATSeg:通过代价聚合实现开放词汇部件分割 open-vocabulary open vocabulary
22 VanGogh: A Unified Multimodal Diffusion-based Framework for Video Colorization VanGogh:一种用于视频着色的统一多模态扩散框架 optical flow multimodal
23 Normal-NeRF: Ambiguity-Robust Normal Estimation for Highly Reflective Scenes Normal-NeRF:提出稳健的法向量估计方法,解决高反射场景NeRF重建难题 NeRF neural radiance field
24 OpticFusion: Multi-Modal Neural Implicit 3D Reconstruction of Microstructures by Fusing White Light Interferometry and Optical Microscopy OpticFusion:融合白光干涉与光学显微的多模态神经隐式微观结构3D重建 implicit representation

🔬 支柱一:机器人控制 (Robot Control) (1 篇)

#题目一句话要点标签🔗
25 Distilling Multi-modal Large Language Models for Autonomous Driving DiMA:通过知识蒸馏提升端到端自动驾驶系统效率与安全性 motion planning large language model

🔬 支柱四:生成式动作 (Generative Motion) (1 篇)

#题目一句话要点标签🔗
26 SynthLight: Portrait Relighting with Diffusion Model by Learning to Re-render Synthetic Faces SynthLight:通过学习重渲染合成人脸,利用扩散模型实现人像光照重打 classifier-free guidance

⬅️ 返回 cs.CV 首页 · 🏠 返回主页