cs.CV(2025-12-13)
📊 共 19 篇论文 | 🔗 2 篇有代码
🎯 兴趣领域导航
支柱三:空间感知与语义 (Perception & Semantics) (6)
支柱九:具身大模型 (Embodied Foundation Models) (6)
支柱二:RL算法与架构 (RL & Architecture) (2)
支柱八:物理动画 (Physics-based Animation) (2 🔗1)
支柱六:视频提取与匹配 (Video Extraction) (1)
支柱七:动作重定向 (Motion Retargeting) (1 🔗1)
支柱四:生成式动作 (Generative Motion) (1)
🔬 支柱三:空间感知与语义 (Perception & Semantics) (6 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 1 | BokehDepth: Enhancing Monocular Depth Estimation through Bokeh Generation | 提出BokehDepth,利用散焦作为辅助几何线索,提升单目深度估计的准确性和鲁棒性。 | depth estimation monocular depth metric depth | ||
| 2 | SMRABooth: Subject and Motion Representation Alignment for Customized Video Generation | SMRABooth:通过主体和运动表征对齐实现定制化视频生成 | optical flow motion representation | ||
| 3 | WeDetect: Fast Open-Vocabulary Object Detection as Retrieval | WeDetect:提出一种快速的开放词汇目标检测检索框架,实现高效且多功能的检测。 | open-vocabulary open vocabulary | ||
| 4 | MRD: Using Physically Based Differentiable Rendering to Probe Vision Models for 3D Scene Understanding | 提出MRD,利用可微渲染探究视觉模型对3D场景的理解能力 | implicit representation scene understanding | ||
| 5 | A Multi-Year Urban Streetlight Imagery Dataset for Visual Monitoring and Spatio-Temporal Drift Detection | 发布城市街道照明多年度图像数据集,用于视觉监控和时空漂移检测。 | scene understanding TAMP | ||
| 6 | Audio-Visual Camera Pose Estimation with Passive Scene Sounds and In-the-Wild Video | 提出一种音视频融合的相机位姿估计方法,利用场景声音增强视觉信息,提升野外视频的鲁棒性。 | scene understanding |
🔬 支柱九:具身大模型 (Embodied Foundation Models) (6 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 7 | EchoVLM: Measurement-Grounded Multimodal Learning for Echocardiography | EchoVLM:面向超声心动图的测量驱动多模态学习 | foundation model multimodal | ||
| 8 | ArtGen: Conditional Generative Modeling of Articulated Objects in Arbitrary Part-Level States | ArtGen:提出一种条件生成模型,用于生成任意部件状态下的铰接物体。 | chain-of-thought | ||
| 9 | VideoARM: Agentic Reasoning over Hierarchical Memory for Long-Form Video Understanding | VideoARM:基于分层记忆的Agentic推理用于长视频理解 | multimodal | ||
| 10 | Cognitive-YOLO: LLM-Driven Architecture Synthesis from First Principles of Data for Object Detection | Cognitive-YOLO:基于数据第一性原理,利用LLM驱动的目标检测架构合成 | large language model | ||
| 11 | Journey Before Destination: On the importance of Visual Faithfulness in Slow Thinking | 提出视觉忠实度评估框架与自反思方法,提升视觉语言模型推理可靠性 | multimodal | ||
| 12 | AutoMV: An Automatic Multi-Agent System for Music Video Generation | AutoMV:一种用于自动生成音乐视频的多智能体系统 | multimodal |
🔬 支柱二:RL算法与架构 (RL & Architecture) (2 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 13 | More Than the Final Answer: Improving Visual Extraction and Logical Consistency in Vision-Language Models | PeRL-VL:通过解耦感知与推理,提升视觉语言模型的多模态推理能力 | reinforcement learning distillation multimodal | ||
| 14 | Moment and Highlight Detection via MLLM Frame Segmentation | 提出基于MLLM框架分割的视频精彩时刻与高光片段检测方法 | reinforcement learning multimodal TAMP |
🔬 支柱八:物理动画 (Physics-based Animation) (2 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 15 | MeltwaterBench: Deep learning for spatiotemporal downscaling of surface meltwater | 提出MeltwaterBench,利用深度学习进行冰川表面融水时空降尺度研究 | spatiotemporal | ✅ | |
| 16 | ALERT Open Dataset and Input-Size-Agnostic Vision Transformer for Driver Activity Recognition using IR-UWB | 提出ISA-ViT和ALERT数据集,用于解决基于IR-UWB雷达的驾驶员行为识别问题 | PULSE |
🔬 支柱六:视频提取与匹配 (Video Extraction) (1 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 17 | M4Human: A Large-Scale Multimodal mmWave Radar Benchmark for Human Mesh Reconstruction | M4Human:用于人体网格重建的大规模多模态毫米波雷达基准数据集 | HMR multimodal |
🔬 支柱七:动作重定向 (Motion Retargeting) (1 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 18 | Endless World: Real-Time 3D-Aware Long Video Generation | 提出Endless World,实现3D一致的实时无限长视频生成 | geometric consistency | ✅ |
🔬 支柱四:生成式动作 (Generative Motion) (1 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 19 | Speedrunning ImageNet Diffusion | 提出SR-DiT,结合多种优化策略加速ImageNet扩散模型训练,显著提升效率。 | classifier-free guidance |