cs.CV(2025-03-30)
📊 共 18 篇论文 | 🔗 4 篇有代码
🎯 兴趣领域导航
支柱三:空间感知与语义 (Perception & Semantics) (5)
支柱二:RL算法与架构 (RL & Architecture) (5 🔗2)
支柱九:具身大模型 (Embodied Foundation Models) (4 🔗1)
支柱八:物理动画 (Physics-based Animation) (2)
支柱四:生成式动作 (Generative Motion) (1 🔗1)
支柱六:视频提取与匹配 (Video Extraction) (1)
🔬 支柱三:空间感知与语义 (Perception & Semantics) (5 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 1 | ReasonGrounder: LVLM-Guided Hierarchical Feature Splatting for Open-Vocabulary 3D Visual Grounding and Reasoning | ReasonGrounder:基于LVLM引导的分层特征Splatting用于开放词汇3D视觉定位与推理 | 3D gaussian splatting gaussian splatting splatting | ||
| 2 | Enhancing 3D Gaussian Splatting Compression via Spatial Condition-based Prediction | 提出基于空间条件预测的3D高斯溅射压缩方法,显著降低存储和传输成本 | 3D gaussian splatting 3DGS gaussian splatting | ||
| 3 | PhysPose: Refining 6D Object Poses with Physical Constraints | PhysPose:通过物理约束优化6D物体姿态估计,提升真实场景应用效果 | scene reconstruction scene understanding penetration | ||
| 4 | Blurry-Edges: Photon-Limited Depth Estimation from Defocused Boundaries | 提出基于模糊边缘表示的深度学习方法,解决光子受限图像的深度估计问题 | depth estimation | ||
| 5 | Multiview Image-Based Localization | 提出一种混合多视图图像定位方法,提升定位精度、效率和内存占用 | scene reconstruction |
🔬 支柱二:RL算法与架构 (RL & Architecture) (5 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 6 | Boosting Omnidirectional Stereo Matching with a Pre-trained Depth Foundation Model | DFI-OmniStereo:利用预训练深度模型提升全景立体匹配精度 | MAE depth estimation monocular depth | ||
| 7 | BoundMatch: Boundary detection applied to semi-supervised segmentation | BoundMatch:提出一种结合边界检测的半监督语义分割框架,提升分割精度。 | teacher-student foundation model | ||
| 8 | Embedding Shift Dissection on CLIP: Effects of Augmentations on VLM's Representation Learning | 研究图像增强对CLIP模型表征的影响,揭示视觉语言模型表征学习的内在机制。 | representation learning | ✅ | |
| 9 | Reinforcement Learning-based Token Pruning in Vision Transformers: A Markov Game Approach | 提出基于强化学习的ViT Token剪枝方法,提升推理速度 | reinforcement learning | ✅ | |
| 10 | ViT-Linearizer: Distilling Quadratic Knowledge into Linear-Time Vision Models | ViT-Linearizer:通过知识蒸馏将二次复杂度ViT模型转化为线性复杂度视觉模型 | Mamba distillation |
🔬 支柱九:具身大模型 (Embodied Foundation Models) (4 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 11 | OpenDriveVLA: Towards End-to-end Autonomous Driving with Large Vision Language Action Model | OpenDriveVLA:基于大型视觉语言动作模型的端到端自动驾驶 | vision-language-action large language model multimodal | ||
| 12 | EagleVision: Object-level Attribute Multimodal LLM for Remote Sensing | 提出EagleVision,一种面向遥感图像对象级属性理解的多模态大语言模型。 | large language model multimodal | ✅ | |
| 13 | Leveraging Vision-Language Foundation Models to Reveal Hidden Image-Attribute Relationships in Medical Imaging | 利用视觉-语言基础模型揭示医学影像中隐藏的属性关系 | foundation model | ||
| 14 | KernelDNA: Dynamic Kernel Sharing via Decoupled Naive Adapters | KernelDNA:通过解耦的朴素适配器实现动态卷积核共享,提升效率。 | large language model |
🔬 支柱八:物理动画 (Physics-based Animation) (2 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 15 | MoCha: Towards Movie-Grade Talking Character Synthesis | MoCha:面向电影级对话角色合成,实现逼真、可控的全身角色动画生成 | character animation | ||
| 16 | OwlSight: A Robust Illumination Adaptation Framework for Dark Video Human Action Recognition | OwlSight:一种鲁棒的暗光视频人体行为识别光照自适应框架 | spatiotemporal |
🔬 支柱四:生成式动作 (Generative Motion) (1 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 17 | VLIPP: Towards Physically Plausible Video Generation with Vision and Language Informed Physical Prior | VLIPP:利用视觉语言信息物理先验,实现物理上合理的视频生成 | physically plausible chain-of-thought | ✅ |
🔬 支柱六:视频提取与匹配 (Video Extraction) (1 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 18 | Learning Predictive Visuomotor Coordination | 提出基于预测的视觉运动协调表示(VCR),用于预测头部姿态、视线和上身运动。 | egocentric egocentric vision multimodal |