cs.CV（2025-01-14）

📊 共 24 篇论文 | 🔗 5 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (9 🔗2) 支柱三：空间感知与语义 (Perception & Semantics) (7 🔗1) 支柱二：RL算法与架构 (RL & Architecture) (5) 支柱六：视频提取与匹配 (Video Extraction) (2 🔗2) 支柱一：机器人控制 (Robot Control) (1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (9 篇)

#	题目	一句话要点	标签	🔗	⭐
1	LLaVA-ST: A Multimodal Large Language Model for Fine-Grained Spatial-Temporal Understanding	LLaVA-ST：用于细粒度时空理解的多模态大语言模型	large language model multimodal	✅
2	Zero-shot Video Moment Retrieval via Off-the-shelf Multimodal Large Language Models	提出Moment-GPT，利用冻结的多模态大语言模型实现零样本视频片段检索。	large language model multimodal
3	Parameter-Inverted Image Pyramid Networks for Visual Perception and Multimodal Understanding	提出参数倒置图像金字塔网络(PIIP)，以低计算成本提升视觉感知和多模态理解性能。	large language model foundation model multimodal	✅
4	Advancing Semantic Future Prediction through Multimodal Visual Sequence Transformers	FUTURIST：提出基于多模态视觉序列Transformer的语义未来预测方法	multimodal
5	Benchmarking Multimodal Models for Fine-Grained Image Analysis: A Comparative Study Across Diverse Visual Features	构建多模态图像分析基准，评估模型在细粒度视觉特征理解上的能力	multimodal
6	Benchmarking Vision Foundation Models for Input Monitoring in Autonomous Driving	利用视觉基础模型进行自动驾驶输入监控的异常检测	foundation model
7	Facial Dynamics in Video: Instruction Tuning for Improved Facial Expression Perception and Contextual Awareness	提出FaceTrack-MM与FEC-Bench，提升视频MLLM在动态面部表情感知和上下文理解能力	large language model multimodal instruction following
8	Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks	Omni-RGPT：通过Token Mark统一图像和视频的区域级理解	large language model multimodal
9	Vchitect-2.0: Parallel Transformer for Scaling Up Video Diffusion Models	Vchitect-2.0：并行Transformer架构，扩展视频扩散模型用于大规模文本到视频生成。	multimodal

🔬 支柱三：空间感知与语义 (Perception & Semantics) (7 篇)

#	题目	一句话要点	标签	🔗	⭐
10	DAViD: Modeling Dynamic Affordance of 3D Objects Using Pre-trained Video Diffusion Models	DAViD：利用预训练视频扩散模型建模3D对象的动态可供性	affordance motion diffusion model MDM
11	A Critical Synthesis of Uncertainty Quantification and Foundation Models in Monocular Depth Estimation	融合不确定性量化与深度基础模型，提升单目深度估计的可靠性	depth estimation monocular depth metric depth
12	3UR-LLM: An End-to-End Multimodal Large Language Model for 3D Scene Understanding	提出3UR-LLM，用于3D场景理解的端到端多模态大语言模型	scene understanding large language model multimodal
13	Revisiting Birds Eye View Perception Models with Frozen Foundation Models: DINOv2 and Metric3Dv2	利用冻结的DINOv2和Metric3Dv2提升鸟瞰图感知模型性能	depth estimation Metric3D foundation model
14	Object-Centric 2D Gaussian Splatting: Background Removal and Occlusion-Aware Pruning for Compact Object Models	提出面向对象的2D高斯溅射，通过背景移除和遮挡感知剪枝实现紧凑的对象模型。	gaussian splatting splatting
15	Automotive Elevation Mapping with Interferometric Synthetic Aperture Radar	利用干涉合成孔径雷达实现车辆高程精确测绘，适用于城市和农业环境	elevation map
16	Go-with-the-Flow: Motion-Controllable Video Diffusion Models Using Real-Time Warped Noise	提出基于实时噪声扭曲的运动可控视频扩散模型，实现灵活的视频生成控制。	optical flow	✅

🔬 支柱二：RL算法与架构 (RL & Architecture) (5 篇)

#	题目	一句话要点	标签	🔗	⭐
17	FLAVARS: A Multimodal Foundational Language and Vision Alignment Model for Remote Sensing	FLAVARS：遥感多模态基础语言-视觉对齐模型，兼顾视觉任务性能与零样本能力。	MAE contrastive learning multimodal
18	DH-Mamba: Exploring Dual-domain Hierarchical State Space Models for MRI Reconstruction	提出DH-Mamba，利用双域分层状态空间模型高效重建MRI图像。	Mamba state space model
19	AVS-Mamba: Exploring Temporal and Multi-modal Mamba for Audio-Visual Segmentation	AVS-Mamba：探索时序和多模态Mamba模型用于音视频分割	Mamba state space model
20	AgentPose: Progressive Distribution Alignment via Feature Agent for Human Pose Distillation	提出AgentPose，通过特征代理实现渐进式分布对齐，提升人体姿态估计蒸馏性能。	distillation
21	Balance Divergence for Knowledge Distillation	提出平衡散度蒸馏，解决知识蒸馏中负知识利用不足的问题。	distillation

🔬 支柱六：视频提取与匹配 (Video Extraction) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
22	BioPose: Biomechanically-accurate 3D Pose Estimation from Monocular Videos	BioPose：提出一种从单目视频中进行生物力学精确的三维姿态估计框架	human mesh recovery HMR SMPL	✅
23	Predicting 4D Hand Trajectory from Monocular Videos	提出HaPTIC，从单目视频预测连贯的4D手部轨迹，提升全局轨迹精度。	egocentric	✅

🔬 支柱一：机器人控制 (Robot Control) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
24	LayerAnimate: Layer-level Control for Animation	LayerAnimate：提出层级控制的视频扩散框架，赋能动画创作。	manipulation

⬅️ 返回 cs.CV 首页 · 🏠 返回主页