cs.CV(2025-01-14)

📊 共 24 篇论文 | 🔗 5 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (9 🔗2) 支柱三:空间感知与语义 (Perception & Semantics) (7 🔗1) 支柱二:RL算法与架构 (RL & Architecture) (5) 支柱六:视频提取与匹配 (Video Extraction) (2 🔗2) 支柱一:机器人控制 (Robot Control) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (9 篇)

#题目一句话要点标签🔗
1 LLaVA-ST: A Multimodal Large Language Model for Fine-Grained Spatial-Temporal Understanding LLaVA-ST:用于细粒度时空理解的多模态大语言模型 large language model multimodal
2 Zero-shot Video Moment Retrieval via Off-the-shelf Multimodal Large Language Models 提出Moment-GPT,利用冻结的多模态大语言模型实现零样本视频片段检索。 large language model multimodal
3 Parameter-Inverted Image Pyramid Networks for Visual Perception and Multimodal Understanding 提出参数倒置图像金字塔网络(PIIP),以低计算成本提升视觉感知和多模态理解性能。 large language model foundation model multimodal
4 Advancing Semantic Future Prediction through Multimodal Visual Sequence Transformers FUTURIST:提出基于多模态视觉序列Transformer的语义未来预测方法 multimodal
5 Benchmarking Multimodal Models for Fine-Grained Image Analysis: A Comparative Study Across Diverse Visual Features 构建多模态图像分析基准,评估模型在细粒度视觉特征理解上的能力 multimodal
6 Benchmarking Vision Foundation Models for Input Monitoring in Autonomous Driving 利用视觉基础模型进行自动驾驶输入监控的异常检测 foundation model
7 Facial Dynamics in Video: Instruction Tuning for Improved Facial Expression Perception and Contextual Awareness 提出FaceTrack-MM与FEC-Bench,提升视频MLLM在动态面部表情感知和上下文理解能力 large language model multimodal instruction following
8 Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks Omni-RGPT:通过Token Mark统一图像和视频的区域级理解 large language model multimodal
9 Vchitect-2.0: Parallel Transformer for Scaling Up Video Diffusion Models Vchitect-2.0:并行Transformer架构,扩展视频扩散模型用于大规模文本到视频生成。 multimodal

🔬 支柱三:空间感知与语义 (Perception & Semantics) (7 篇)

#题目一句话要点标签🔗
10 DAViD: Modeling Dynamic Affordance of 3D Objects Using Pre-trained Video Diffusion Models DAViD:利用预训练视频扩散模型建模3D对象的动态可供性 affordance motion diffusion model MDM
11 A Critical Synthesis of Uncertainty Quantification and Foundation Models in Monocular Depth Estimation 融合不确定性量化与深度基础模型,提升单目深度估计的可靠性 depth estimation monocular depth metric depth
12 3UR-LLM: An End-to-End Multimodal Large Language Model for 3D Scene Understanding 提出3UR-LLM,用于3D场景理解的端到端多模态大语言模型 scene understanding large language model multimodal
13 Revisiting Birds Eye View Perception Models with Frozen Foundation Models: DINOv2 and Metric3Dv2 利用冻结的DINOv2和Metric3Dv2提升鸟瞰图感知模型性能 depth estimation Metric3D foundation model
14 Object-Centric 2D Gaussian Splatting: Background Removal and Occlusion-Aware Pruning for Compact Object Models 提出面向对象的2D高斯溅射,通过背景移除和遮挡感知剪枝实现紧凑的对象模型。 gaussian splatting splatting
15 Automotive Elevation Mapping with Interferometric Synthetic Aperture Radar 利用干涉合成孔径雷达实现车辆高程精确测绘,适用于城市和农业环境 elevation map
16 Go-with-the-Flow: Motion-Controllable Video Diffusion Models Using Real-Time Warped Noise 提出基于实时噪声扭曲的运动可控视频扩散模型,实现灵活的视频生成控制。 optical flow

🔬 支柱二:RL算法与架构 (RL & Architecture) (5 篇)

#题目一句话要点标签🔗
17 FLAVARS: A Multimodal Foundational Language and Vision Alignment Model for Remote Sensing FLAVARS:遥感多模态基础语言-视觉对齐模型,兼顾视觉任务性能与零样本能力。 MAE contrastive learning multimodal
18 DH-Mamba: Exploring Dual-domain Hierarchical State Space Models for MRI Reconstruction 提出DH-Mamba,利用双域分层状态空间模型高效重建MRI图像。 Mamba state space model
19 AVS-Mamba: Exploring Temporal and Multi-modal Mamba for Audio-Visual Segmentation AVS-Mamba:探索时序和多模态Mamba模型用于音视频分割 Mamba state space model
20 AgentPose: Progressive Distribution Alignment via Feature Agent for Human Pose Distillation 提出AgentPose,通过特征代理实现渐进式分布对齐,提升人体姿态估计蒸馏性能。 distillation
21 Balance Divergence for Knowledge Distillation 提出平衡散度蒸馏,解决知识蒸馏中负知识利用不足的问题。 distillation

🔬 支柱六:视频提取与匹配 (Video Extraction) (2 篇)

#题目一句话要点标签🔗
22 BioPose: Biomechanically-accurate 3D Pose Estimation from Monocular Videos BioPose:提出一种从单目视频中进行生物力学精确的三维姿态估计框架 human mesh recovery HMR SMPL
23 Predicting 4D Hand Trajectory from Monocular Videos 提出HaPTIC,从单目视频预测连贯的4D手部轨迹,提升全局轨迹精度。 egocentric

🔬 支柱一:机器人控制 (Robot Control) (1 篇)

#题目一句话要点标签🔗
24 LayerAnimate: Layer-level Control for Animation LayerAnimate:提出层级控制的视频扩散框架,赋能动画创作。 manipulation

⬅️ 返回 cs.CV 首页 · 🏠 返回主页