cs.CV（2025-12-13）

📊 共 19 篇论文 | 🔗 2 篇有代码

🎯 兴趣领域导航

支柱三：空间感知与语义 (Perception & Semantics) (6) 支柱九：具身大模型 (Embodied Foundation Models) (6) 支柱二：RL算法与架构 (RL & Architecture) (2) 支柱八：物理动画 (Physics-based Animation) (2 🔗1) 支柱六：视频提取与匹配 (Video Extraction) (1) 支柱七：动作重定向 (Motion Retargeting) (1 🔗1) 支柱四：生成式动作 (Generative Motion) (1)

🔬 支柱三：空间感知与语义 (Perception & Semantics) (6 篇)

#	题目	一句话要点	标签	🔗	⭐
1	BokehDepth: Enhancing Monocular Depth Estimation through Bokeh Generation	提出BokehDepth，利用散焦作为辅助几何线索，提升单目深度估计的准确性和鲁棒性。	depth estimation monocular depth metric depth
2	SMRABooth: Subject and Motion Representation Alignment for Customized Video Generation	SMRABooth：通过主体和运动表征对齐实现定制化视频生成	optical flow motion representation
3	WeDetect: Fast Open-Vocabulary Object Detection as Retrieval	WeDetect：提出一种快速的开放词汇目标检测检索框架，实现高效且多功能的检测。	open-vocabulary open vocabulary
4	MRD: Using Physically Based Differentiable Rendering to Probe Vision Models for 3D Scene Understanding	提出MRD，利用可微渲染探究视觉模型对3D场景的理解能力	implicit representation scene understanding
5	A Multi-Year Urban Streetlight Imagery Dataset for Visual Monitoring and Spatio-Temporal Drift Detection	发布城市街道照明多年度图像数据集，用于视觉监控和时空漂移检测。	scene understanding TAMP
6	Audio-Visual Camera Pose Estimation with Passive Scene Sounds and In-the-Wild Video	提出一种音视频融合的相机位姿估计方法，利用场景声音增强视觉信息，提升野外视频的鲁棒性。	scene understanding

🔬 支柱九：具身大模型 (Embodied Foundation Models) (6 篇)

#	题目	一句话要点	标签	🔗	⭐
7	EchoVLM: Measurement-Grounded Multimodal Learning for Echocardiography	EchoVLM：面向超声心动图的测量驱动多模态学习	foundation model multimodal
8	ArtGen: Conditional Generative Modeling of Articulated Objects in Arbitrary Part-Level States	ArtGen：提出一种条件生成模型，用于生成任意部件状态下的铰接物体。	chain-of-thought
9	VideoARM: Agentic Reasoning over Hierarchical Memory for Long-Form Video Understanding	VideoARM：基于分层记忆的Agentic推理用于长视频理解	multimodal
10	Cognitive-YOLO: LLM-Driven Architecture Synthesis from First Principles of Data for Object Detection	Cognitive-YOLO：基于数据第一性原理，利用LLM驱动的目标检测架构合成	large language model
11	Journey Before Destination: On the importance of Visual Faithfulness in Slow Thinking	提出视觉忠实度评估框架与自反思方法，提升视觉语言模型推理可靠性	multimodal
12	AutoMV: An Automatic Multi-Agent System for Music Video Generation	AutoMV：一种用于自动生成音乐视频的多智能体系统	multimodal

🔬 支柱二：RL算法与架构 (RL & Architecture) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
13	More Than the Final Answer: Improving Visual Extraction and Logical Consistency in Vision-Language Models	PeRL-VL：通过解耦感知与推理，提升视觉语言模型的多模态推理能力	reinforcement learning distillation multimodal
14	Moment and Highlight Detection via MLLM Frame Segmentation	提出基于MLLM框架分割的视频精彩时刻与高光片段检测方法	reinforcement learning multimodal TAMP

🔬 支柱八：物理动画 (Physics-based Animation) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
15	MeltwaterBench: Deep learning for spatiotemporal downscaling of surface meltwater	提出MeltwaterBench，利用深度学习进行冰川表面融水时空降尺度研究	spatiotemporal	✅
16	ALERT Open Dataset and Input-Size-Agnostic Vision Transformer for Driver Activity Recognition using IR-UWB	提出ISA-ViT和ALERT数据集，用于解决基于IR-UWB雷达的驾驶员行为识别问题	PULSE

🔬 支柱六：视频提取与匹配 (Video Extraction) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
17	M4Human: A Large-Scale Multimodal mmWave Radar Benchmark for Human Mesh Reconstruction	M4Human：用于人体网格重建的大规模多模态毫米波雷达基准数据集	HMR multimodal

🔬 支柱七：动作重定向 (Motion Retargeting) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
18	Endless World: Real-Time 3D-Aware Long Video Generation	提出Endless World，实现3D一致的实时无限长视频生成	geometric consistency	✅

🔬 支柱四：生成式动作 (Generative Motion) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
19	Speedrunning ImageNet Diffusion	提出SR-DiT，结合多种优化策略加速ImageNet扩散模型训练，显著提升效率。	classifier-free guidance

⬅️ 返回 cs.CV 首页 · 🏠 返回主页