cs.CV(2025-01-12)
📊 共 15 篇论文 | 🔗 4 篇有代码
🎯 兴趣领域导航
支柱三:空间感知与语义 (Perception & Semantics) (5 🔗1)
支柱九:具身大模型 (Embodied Foundation Models) (4 🔗2)
支柱二:RL算法与架构 (RL & Architecture) (3 🔗1)
支柱六:视频提取与匹配 (Video Extraction) (1)
支柱一:机器人控制 (Robot Control) (1)
支柱八:物理动画 (Physics-based Animation) (1)
🔬 支柱三:空间感知与语义 (Perception & Semantics) (5 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 1 | Semantic-CD: Remote Sensing Image Semantic Change Detection towards Open-vocabulary Setting | 提出Semantic-CD,利用CLIP增强遥感图像语义变化检测的泛化能力。 | open-vocabulary open vocabulary foundation model | ||
| 2 | Generalized and Efficient 2D Gaussian Splatting for Arbitrary-scale Super-Resolution | 提出GSASR,利用广义高效的2D高斯溅射实现任意尺度超分辨率重建 | gaussian splatting splatting | ✅ | |
| 3 | F3D-Gaus: Feed-forward 3D-aware Generation on ImageNet with Cycle-Aggregative Gaussian Splatting | 提出F3D-Gaus,利用循环聚合高斯溅射实现ImageNet上可泛化的3D感知生成。 | gaussian splatting splatting | ||
| 4 | ActiveGAMER: Active GAussian Mapping through Efficient Rendering | ActiveGAMER:通过高效渲染实现主动高斯映射,用于实时场景探索与重建。 | 3D gaussian splatting 3DGS gaussian splatting | ||
| 5 | Synthetic Prior for Few-Shot Drivable Head Avatar Inversion | SynShot:基于合成先验的少样本可驱动头部头像反演方法 | 3D gaussian splatting gaussian splatting splatting |
🔬 支柱九:具身大模型 (Embodied Foundation Models) (4 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 6 | GeoPix: Multi-Modal Large Language Model for Pixel-level Image Understanding in Remote Sensing | GeoPix:用于遥感图像像素级理解的多模态大语言模型 | large language model visual grounding | ||
| 7 | RSRefSeg: Referring Remote Sensing Image Segmentation with Foundation Models | 提出RSRefSeg,利用基础模型解决遥感图像的指代表达分割问题 | foundation model multimodal | ✅ | |
| 8 | Multi-task Visual Grounding with Coarse-to-Fine Consistency Constraints | 提出C³VG模型,通过粗细粒度一致性约束解决多任务视觉定位与分割的不一致性问题。 | multimodal visual grounding | ✅ | |
| 9 | SCOT: Self-Supervised Contrastive Pretraining For Zero-Shot Compositional Retrieval | SCOT:用于零样本组合检索的自监督对比预训练方法 | large language model multimodal |
🔬 支柱二:RL算法与架构 (RL & Architecture) (3 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 10 | Application of Vision-Language Model to Pedestrians Behavior and Scene Understanding in Autonomous Driving | 提出基于知识蒸馏的视觉-语言模型,用于提升自动驾驶中行人行为理解和场景感知能力。 | distillation scene understanding open-vocabulary | ||
| 11 | VidChain: Chain-of-Tasks with Metric-based Direct Preference Optimization for Dense Video Captioning | 提出VidChain以解决密集视频字幕生成中的细粒度时序理解问题 | DPO direct preference optimization large language model | ✅ | |
| 12 | Mamba-MOC: A Multicategory Remote Object Counting via State Space Model | 提出Mamba-MOC,利用状态空间模型解决多类别遥感目标计数问题 | Mamba state space model |
🔬 支柱六:视频提取与匹配 (Video Extraction) (1 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 13 | X-LeBench: A Benchmark for Extremely Long Egocentric Video Understanding | 提出X-LeBench,用于评估极长第一人称视角视频理解能力,填补了现有基准数据集的空白。 | egocentric Ego4D large language model |
🔬 支柱一:机器人控制 (Robot Control) (1 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 14 | Vid2Sim: Realistic and Interactive Simulation from Video for Urban Navigation | Vid2Sim:通过视频生成逼真交互式仿真环境,提升城市导航性能 | sim-to-real sim2real real2sim |
🔬 支柱八:物理动画 (Physics-based Animation) (1 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 15 | Temporal-Aware Spiking Transformer Hashing Based on 3D-DWT | 提出基于3D-DWT的时序感知脉冲Transformer哈希Spikinghash,用于高效动态视觉数据检索。 | spatiotemporal |