cs.CV（2025-01-16）

📊 共 26 篇论文 | 🔗 8 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (9 🔗4) 支柱二：RL算法与架构 (RL & Architecture) (8 🔗1) 支柱三：空间感知与语义 (Perception & Semantics) (7 🔗2) 支柱一：机器人控制 (Robot Control) (1) 支柱四：生成式动作 (Generative Motion) (1 🔗1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (9 篇)

#	题目	一句话要点	标签	🔗	⭐
1	A Simple Aerial Detection Baseline of Multimodal Language Models	提出LMMRotate，首次探索多模态语言模型在遥感图像目标检测中的应用	foundation model multimodal visual grounding	✅
2	Omni-Emotion: Extending Video MLLM with Detailed Face and Audio Modeling for Multimodal Emotion Analysis	Omni-Emotion：通过细粒度人脸和音频建模扩展视频MLLM，用于多模态情感分析	large language model multimodal
3	AugRefer: Advancing 3D Visual Grounding via Cross-Modal Augmentation and Spatial Relation-based Referring	AugRefer：通过跨模态增强和空间关系引导，提升3D视觉定位性能	foundation model visual grounding
4	Text-driven Adaptation of Foundation Models for Few-shot Surgical Workflow Analysis	提出Surg-FTDA，用于少量样本的手术流程分析，降低数据依赖。	foundation model	✅
5	Scaling up self-supervised learning for improved surgical foundation models	提出SurgeNetXL，通过大规模自监督学习显著提升手术计算机视觉的基础模型性能。	foundation model	✅
6	SMPLest-X: Ultimate Scaling for Expressive Human Pose and Shape Estimation	SMPLest-X：通过极致扩展实现富有表现力的人体姿态和形状估计	foundation model	✅
7	Lost in Translation, Found in Context: Sign Language Translation with Contextual Cues	提出结合上下文线索的手语翻译框架，提升翻译准确性	large language model
8	CHIRP: A Fine-Grained Benchmark for Open-Ended Response Evaluation in Vision-Language Models	提出CHIRP基准，用于细粒度评估视觉-语言模型开放式响应生成能力	large language model
9	ASCENT-ViT: Attention-based Scale-aware Concept Learning Framework for Enhanced Alignment in Vision Transformers	提出ASCENT-ViT，通过注意力机制和尺度感知概念学习增强ViT的可解释性。	foundation model

🔬 支柱二：RL算法与架构 (RL & Architecture) (8 篇)

#	题目	一句话要点	标签	🔗	⭐
10	WMamba: Wavelet-based Mamba for Face Forgery Detection	WMamba：基于小波变换的Mamba架构用于人脸伪造检测	Mamba spatial relationship
11	VideoWorld: Exploring Knowledge Learning from Unlabeled Videos	VideoWorld：探索从无标签视频中学习知识的深度生成模型	reinforcement learning latent dynamics large language model
12	Mitigating Hallucinations in Large Vision-Language Models via DPO: On-Policy Data Hold the Key	OPA-DPO：通过On-Policy数据对齐缓解大型视觉语言模型中的幻觉问题	DPO direct preference optimization	✅
13	The Devil is in the Details: Simple Remedies for Image-to-LiDAR Representation Learning	针对图像到LiDAR表示学习，通过优化坐标系、量化和数据利用提升性能。	representation learning distillation
14	Strategic Base Representation Learning via Feature Augmentations for Few-Shot Class Incremental Learning	提出基于特征增强的对比学习框架，解决少样本类增量学习中的类别区分问题。	representation learning contrastive learning
15	Soft Knowledge Distillation with Multi-Dimensional Cross-Net Attention for Image Restoration Models Compression	提出基于多维交叉注意力软知识蒸馏的图像修复模型压缩方法	contrastive learning distillation
16	Knowledge Distillation for Image Restoration : Simultaneous Learning from Degraded and Clean Images	提出SLKD框架，通过双教师知识蒸馏压缩图像复原模型，显著降低计算量。	DRL distillation
17	Towards Robust and Realistic Human Pose Estimation via WiFi Signals	提出DT-Pose框架，解决WiFi信号人体姿态估计中的跨域和结构保真度问题	representation learning contrastive learning

🔬 支柱三：空间感知与语义 (Perception & Semantics) (7 篇)

#	题目	一句话要点	标签	🔗	⭐
18	Creating Virtual Environments with 3D Gaussian Splatting: A Comparative Study	基于3D高斯溅射的虚拟环境创建方法比较研究	3D gaussian splatting 3DGS gaussian splatting
19	DEFOM-Stereo: Depth Foundation Model Based Stereo Matching	DEFOM-Stereo：基于深度基础模型的立体匹配方法，提升零样本泛化能力。	depth estimation monocular depth metric depth
20	Are Open-Vocabulary Models Ready for Detection of MEP Elements on Construction Sites	评估开放词汇模型在建筑工地MEP元件检测中的适用性	open-vocabulary open vocabulary
21	Fine-Grained Image-Text Correspondence with Cost Aggregation for Open-Vocabulary Part Segmentation	PartCATSeg：通过代价聚合实现开放词汇部件分割	open-vocabulary open vocabulary
22	VanGogh: A Unified Multimodal Diffusion-based Framework for Video Colorization	VanGogh：一种用于视频着色的统一多模态扩散框架	optical flow multimodal	✅
23	Normal-NeRF: Ambiguity-Robust Normal Estimation for Highly Reflective Scenes	Normal-NeRF：提出稳健的法向量估计方法，解决高反射场景NeRF重建难题	NeRF neural radiance field
24	OpticFusion: Multi-Modal Neural Implicit 3D Reconstruction of Microstructures by Fusing White Light Interferometry and Optical Microscopy	OpticFusion：融合白光干涉与光学显微的多模态神经隐式微观结构3D重建	implicit representation	✅

🔬 支柱一：机器人控制 (Robot Control) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
25	Distilling Multi-modal Large Language Models for Autonomous Driving	DiMA：通过知识蒸馏提升端到端自动驾驶系统效率与安全性	motion planning large language model

🔬 支柱四：生成式动作 (Generative Motion) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
26	SynthLight: Portrait Relighting with Diffusion Model by Learning to Re-render Synthetic Faces	SynthLight：通过学习重渲染合成人脸，利用扩散模型实现人像光照重打	classifier-free guidance	✅

⬅️ 返回 cs.CV 首页 · 🏠 返回主页