cs.CV(2025-12-13)

📊 共 19 篇论文 | 🔗 2 篇有代码

🎯 兴趣领域导航

支柱三:空间感知与语义 (Perception & Semantics) (6) 支柱九:具身大模型 (Embodied Foundation Models) (6) 支柱二:RL算法与架构 (RL & Architecture) (2) 支柱八:物理动画 (Physics-based Animation) (2 🔗1) 支柱六:视频提取与匹配 (Video Extraction) (1) 支柱七:动作重定向 (Motion Retargeting) (1 🔗1) 支柱四:生成式动作 (Generative Motion) (1)

🔬 支柱三:空间感知与语义 (Perception & Semantics) (6 篇)

#题目一句话要点标签🔗
1 BokehDepth: Enhancing Monocular Depth Estimation through Bokeh Generation 提出BokehDepth,利用散焦作为辅助几何线索,提升单目深度估计的准确性和鲁棒性。 depth estimation monocular depth metric depth
2 SMRABooth: Subject and Motion Representation Alignment for Customized Video Generation SMRABooth:通过主体和运动表征对齐实现定制化视频生成 optical flow motion representation
3 WeDetect: Fast Open-Vocabulary Object Detection as Retrieval WeDetect:提出一种快速的开放词汇目标检测检索框架,实现高效且多功能的检测。 open-vocabulary open vocabulary
4 MRD: Using Physically Based Differentiable Rendering to Probe Vision Models for 3D Scene Understanding 提出MRD,利用可微渲染探究视觉模型对3D场景的理解能力 implicit representation scene understanding
5 A Multi-Year Urban Streetlight Imagery Dataset for Visual Monitoring and Spatio-Temporal Drift Detection 发布城市街道照明多年度图像数据集,用于视觉监控和时空漂移检测。 scene understanding TAMP
6 Audio-Visual Camera Pose Estimation with Passive Scene Sounds and In-the-Wild Video 提出一种音视频融合的相机位姿估计方法,利用场景声音增强视觉信息,提升野外视频的鲁棒性。 scene understanding

🔬 支柱九:具身大模型 (Embodied Foundation Models) (6 篇)

#题目一句话要点标签🔗
7 EchoVLM: Measurement-Grounded Multimodal Learning for Echocardiography EchoVLM:面向超声心动图的测量驱动多模态学习 foundation model multimodal
8 ArtGen: Conditional Generative Modeling of Articulated Objects in Arbitrary Part-Level States ArtGen:提出一种条件生成模型,用于生成任意部件状态下的铰接物体。 chain-of-thought
9 VideoARM: Agentic Reasoning over Hierarchical Memory for Long-Form Video Understanding VideoARM:基于分层记忆的Agentic推理用于长视频理解 multimodal
10 Cognitive-YOLO: LLM-Driven Architecture Synthesis from First Principles of Data for Object Detection Cognitive-YOLO:基于数据第一性原理,利用LLM驱动的目标检测架构合成 large language model
11 Journey Before Destination: On the importance of Visual Faithfulness in Slow Thinking 提出视觉忠实度评估框架与自反思方法,提升视觉语言模型推理可靠性 multimodal
12 AutoMV: An Automatic Multi-Agent System for Music Video Generation AutoMV:一种用于自动生成音乐视频的多智能体系统 multimodal

🔬 支柱二:RL算法与架构 (RL & Architecture) (2 篇)

#题目一句话要点标签🔗
13 More Than the Final Answer: Improving Visual Extraction and Logical Consistency in Vision-Language Models PeRL-VL:通过解耦感知与推理,提升视觉语言模型的多模态推理能力 reinforcement learning distillation multimodal
14 Moment and Highlight Detection via MLLM Frame Segmentation 提出基于MLLM框架分割的视频精彩时刻与高光片段检测方法 reinforcement learning multimodal TAMP

🔬 支柱八:物理动画 (Physics-based Animation) (2 篇)

#题目一句话要点标签🔗
15 MeltwaterBench: Deep learning for spatiotemporal downscaling of surface meltwater 提出MeltwaterBench,利用深度学习进行冰川表面融水时空降尺度研究 spatiotemporal
16 ALERT Open Dataset and Input-Size-Agnostic Vision Transformer for Driver Activity Recognition using IR-UWB 提出ISA-ViT和ALERT数据集,用于解决基于IR-UWB雷达的驾驶员行为识别问题 PULSE

🔬 支柱六:视频提取与匹配 (Video Extraction) (1 篇)

#题目一句话要点标签🔗
17 M4Human: A Large-Scale Multimodal mmWave Radar Benchmark for Human Mesh Reconstruction M4Human:用于人体网格重建的大规模多模态毫米波雷达基准数据集 HMR multimodal

🔬 支柱七:动作重定向 (Motion Retargeting) (1 篇)

#题目一句话要点标签🔗
18 Endless World: Real-Time 3D-Aware Long Video Generation 提出Endless World,实现3D一致的实时无限长视频生成 geometric consistency

🔬 支柱四:生成式动作 (Generative Motion) (1 篇)

#题目一句话要点标签🔗
19 Speedrunning ImageNet Diffusion 提出SR-DiT,结合多种优化策略加速ImageNet扩散模型训练,显著提升效率。 classifier-free guidance

⬅️ 返回 cs.CV 首页 · 🏠 返回主页