cs.CV(2025-12-12)

📊 共 39 篇论文 | 🔗 7 篇有代码

🎯 兴趣领域导航

支柱三:空间感知与语义 (Perception & Semantics) (12 🔗3) 支柱九:具身大模型 (Embodied Foundation Models) (9 🔗2) 支柱二:RL算法与架构 (RL & Architecture) (9 🔗1) 支柱一:机器人控制 (Robot Control) (3) 支柱六:视频提取与匹配 (Video Extraction) (2) 支柱四:生成式动作 (Generative Motion) (1 🔗1) 支柱五:交互与反应 (Interaction & Reaction) (1) 支柱八:物理动画 (Physics-based Animation) (1) 支柱七:动作重定向 (Motion Retargeting) (1)

🔬 支柱三:空间感知与语义 (Perception & Semantics) (12 篇)

#题目一句话要点标签🔗
1 Moment-Based 3D Gaussian Splatting: Resolving Volumetric Occlusion with Order-Independent Transmittance 提出基于矩的3D高斯溅射,通过与顺序无关的透射率解决体积遮挡问题。 3D gaussian splatting 3DGS gaussian splatting
2 Prior-Enhanced Gaussian Splatting for Dynamic Scene Reconstruction from Casual Video 提出先验增强的高斯溅射方法,用于从随手拍摄的视频中重建动态场景。 gaussian splatting splatting scene reconstruction
3 Lightweight 3D Gaussian Splatting Compression via Video Codec 提出基于视频编解码器的轻量级3D高斯溅射压缩方法,提升低比特率下的压缩性能。 3D gaussian splatting gaussian splatting splatting
4 VLM2GeoVec: Toward Universal Multimodal Embeddings for Remote Sensing 提出VLM2GeoVec,用于遥感领域通用多模态嵌入,统一检索与区域理解。 scene understanding multimodal instruction following
5 Semantic-Drive: Democratizing Long-Tail Data Curation via Open-Vocabulary Grounding and Neuro-Symbolic VLM Consensus Semantic-Drive:通过开放词汇 grounding 和神经符号 VLM 共识实现长尾数据挖掘 open-vocabulary open vocabulary symbolic grounding
6 MultiEgo: A Multi-View Egocentric Video Dataset for 4D Scene Reconstruction 提出MultiEgo数据集,用于多视角以自我中心视频的4D场景重建 scene reconstruction egocentric
7 Depth-Copy-Paste: Multimodal and Depth-Aware Compositing for Robust Face Detection 提出Depth-Copy-Paste,通过多模态深度感知合成增强人脸检测鲁棒性。 Depth Anything multimodal
8 Evaluating Foundation Models' 3D Understanding Through Multi-View Correspondence Analysis 提出基于多视角对应分析的3D理解能力评估基准,无需微调。 scene understanding foundation model
9 On Geometric Understanding and Learned Data Priors in VGGT 分析VGGT几何理解能力:揭示其隐式几何建模与数据先验依赖 VGGT foundation model
10 Reconstruction as a Bridge for Event-Based Visual Question Answering 提出基于重建桥梁的事件相机视觉问答方法,并构建EvQA基准 scene understanding large language model multimodal
11 Structure From Tracking: Distilling Structure-Preserving Motion for Video Generation 提出SAM2VideoX,通过结构保持的运动先验提升视频生成质量 optical flow motion representation
12 Super-Resolved Canopy Height Mapping from Sentinel-2 Time Series Using LiDAR HD Reference Data across Metropolitan France 提出THREASURE-Net,利用Sentinel-2时间序列和LiDAR数据进行高分辨率森林冠层高度制图。 height map

🔬 支柱九:具身大模型 (Embodied Foundation Models) (9 篇)

#题目一句话要点标签🔗
13 SmokeBench: Evaluating Multimodal Large Language Models for Wildfire Smoke Detection SmokeBench:评估多模态大语言模型在野火烟雾检测中的性能 large language model multimodal
14 Cross-modal Context-aware Learning for Visual Prompt Guided Multimodal Image Understanding in Remote Sensing 提出CLV-Net以解决遥感图像理解中的用户意图引导问题 large language model multimodal
15 RoomPilot: Controllable Synthesis of Interactive Indoor Environments via Multimodal Semantic Parsing RoomPilot:通过多模态语义解析实现可控的交互式室内环境合成 embodied AI multimodal
16 RePack then Refine: Efficient Diffusion Transformer with Vision Foundation Model 提出RePack then Refine框架,提升VFM赋能扩散Transformer的训练效率与生成质量。 foundation model
17 HFS: Holistic Query-Aware Frame Selection for Efficient Video Reasoning 提出HFS框架,通过整体查询感知的帧选择实现高效视频推理 large language model multimodal chain-of-thought
18 UFVideo: Towards Unified Fine-Grained Video Cooperative Understanding with Large Language Models 提出UFVideo,实现统一的多粒度视频协同理解的视频大语言模型 large language model
19 3DTeethSAM: Taming SAM2 for 3D Teeth Segmentation 3DTeethSAM:利用SAM2进行3D牙齿分割,实现牙科数字化 foundation model
20 Boosting Skeleton-based Zero-Shot Action Recognition with Training-Free Test-Time Adaptation 提出Skeleton-Cache,一种免训练的骨骼零样本动作识别测试时自适应框架。 large language model
21 DynaPURLS: Dynamic Refinement of Part-aware Representations for Skeleton-based Zero-Shot Action Recognition DynaPURLS:动态细化部件感知表征,用于基于骨骼的零样本动作识别 large language model

🔬 支柱二:RL算法与架构 (RL & Architecture) (9 篇)

#题目一句话要点标签🔗
22 VFMF: World Modeling by Forecasting Vision Foundation Model Features VFMF:通过预测视觉基础模型特征实现世界建模 flow matching world model foundation model
23 DentalGPT: Incentivizing Multimodal Complex Reasoning in Dentistry DentalGPT:通过激励多模态复杂推理提升牙科自动化水平 reinforcement learning large language model multimodal
24 Kinetic Mining in Context: Few-Shot Action Synthesis via Text-to-Motion Distillation 提出KineMIC框架,通过文本到动作的知识蒸馏实现少样本动作合成,提升人体动作识别。 distillation text-to-motion
25 TSkel-Mamba: Temporal Dynamic Modeling via State Space Model for Human Skeleton-based Action Recognition 提出TSkel-Mamba,利用状态空间模型进行人体骨骼动作识别,提升时序建模能力。 Mamba SSM state space model
26 REST: Diffusion-based Real-time End-to-end Streaming Talking Head Generation via ID-Context Caching and Asynchronous Streaming Distillation 提出REST框架,实现基于扩散模型的实时端到端流式说话人头部生成。 distillation spatiotemporal
27 Physics-Informed Video Flare Synthesis and Removal Leveraging Motion Independence between Flare and Scene 提出基于物理信息的视频光晕合成与去除方法,解决光晕与场景运动独立性问题。 Mamba optical flow spatiotemporal
28 BAgger: Backwards Aggregation for Mitigating Drift in Autoregressive Video Diffusion Models 提出BAgger,通过反向聚合缓解自回归视频扩散模型中的漂移问题 flow matching world model distillation
29 Flowception: Temporally Expansive Flow Matching for Video Generation Flowception:时序扩展的Flow Matching用于可变长度视频生成 flow matching
30 Robust MLLM Unlearning via Visual Knowledge Distillation 提出基于视觉知识蒸馏的MLLM稳健不可学习方法,解决视觉知识选择性擦除问题 distillation

🔬 支柱一:机器人控制 (Robot Control) (3 篇)

#题目一句话要点标签🔗
31 FutureX: Enhance End-to-End Autonomous Driving via Latent Chain-of-Thought World Model FutureX:基于潜在思维链世界模型的端到端自动驾驶增强方案 motion planning world model chain-of-thought
32 Embodied Image Compression 提出具身图像压缩,解决具身智能体在低比特率下的通信瓶颈问题。 manipulation embodied AI vision-language-action
33 V-RGBX: Video Editing with Accurate Controls over Intrinsic Properties V-RGBX:提出首个内参感知的视频编辑框架,实现精确控制和逼真合成。 manipulation physically plausible

🔬 支柱六:视频提取与匹配 (Video Extraction) (2 篇)

#题目一句话要点标签🔗
34 The N-Body Problem: Parallel Execution from Single-Person Egocentric Video 提出N体问题,通过单人视角视频实现多人并行活动推理。 egocentric
35 Multi-task Learning with Extended Temporal Shift Module for Temporal Action Localization 提出扩展时序位移模块的多任务学习框架,用于时序动作定位。 egocentric

🔬 支柱四:生成式动作 (Generative Motion) (1 篇)

#题目一句话要点标签🔗
36 KeyframeFace: From Text to Expressive Facial Keyframes KeyframeFace:提出基于文本驱动的、可解释的关键帧人脸表情动画生成框架 motion synthesis large language model multimodal

🔬 支柱五:交互与反应 (Interaction & Reaction) (1 篇)

#题目一句话要点标签🔗
37 CARI4D: Category Agnostic 4D Reconstruction of Human-Object Interaction CARI4D:提出一种类别无关的4D人体-物体交互重建方法,解决单目RGB视频重建难题。 human-object interaction foundation model

🔬 支柱八:物理动画 (Physics-based Animation) (1 篇)

#题目一句话要点标签🔗
38 Exploring MLLM-Diffusion Information Transfer with MetaCanvas MetaCanvas:利用MLLM在扩散模型中进行空间推理和规划,提升多模态生成效果 spatiotemporal large language model multimodal

🔬 支柱七:动作重定向 (Motion Retargeting) (1 篇)

#题目一句话要点标签🔗
39 CADMorph: Geometry-Driven Parametric CAD Editing via a Plan-Generate-Verify Loop CADMorph:提出基于几何驱动的参数化CAD编辑框架,实现高效的CAD模型迭代设计。 structure preservation foundation model

⬅️ 返回 cs.CV 首页 · 🏠 返回主页