cs.CV(2025-09-18)

📊 共 24 篇论文 | 🔗 5 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (9 🔗2) 支柱二:RL算法与架构 (RL & Architecture) (7 🔗1) 支柱三:空间感知与语义 (Perception & Semantics) (5 🔗1) 支柱一:机器人控制 (Robot Control) (2) 支柱七:动作重定向 (Motion Retargeting) (1 🔗1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (9 篇)

#题目一句话要点标签🔗
1 How Good are Foundation Models in Step-by-Step Embodied Reasoning? 提出FoMER基准,评估具身环境中基础模型逐步推理能力 foundation model multimodal
2 Unleashing the Potential of Multimodal LLMs for Zero-Shot Spatio-Temporal Video Grounding 利用多模态LLM进行零样本时空视频定位,提出DSTH和TAS策略。 large language model multimodal
3 From Pixels to Urban Policy-Intelligence: Recovering Legacy Effects of Redlining with a Multimodal LLM 利用多模态LLM从像素到城市政策智能:重现红线政策的遗留影响 large language model multimodal
4 Two Web Toolkits for Multimodal Piano Performance Dataset Acquisition and Fingering Annotation 提出用于多模态钢琴演奏数据集采集与指法标注的Web工具包 multimodal
5 V-SenseDrive: A Privacy-Preserving Road Video and In-Vehicle Sensor Fusion Framework for Road Safety & Driver Behaviour Modelling V-SenseDrive:面向道路安全与驾驶行为建模的隐私保护型道路视频与车载传感器融合框架 multimodal
6 ORCA: Agentic Reasoning For Hallucination and Adversarial Robustness in Vision-Language Models ORCA:通过Agentic推理提升视觉-语言模型在幻觉和对抗鲁棒性上的表现 multimodal
7 ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data ScaleCUA:通过跨平台数据扩展开源计算机使用Agent foundation model
8 QuizRank: Picking Images by Quizzing VLMs QuizRank:利用视觉语言模型进行问答式图像排序,提升维基百科文章配图质量。 large language model
9 Seeing 3D Through 2D Lenses: 3D Few-Shot Class-Incremental Learning via Cross-Modal Geometric Rectification 提出CMGR框架,通过跨模态几何校正实现3D少样本类增量学习。 foundation model

🔬 支柱二:RL算法与架构 (RL & Architecture) (7 篇)

#题目一句话要点标签🔗
10 Depth AnyEvent: A Cross-Modal Distillation Paradigm for Event-Based Monocular Depth Estimation 提出基于跨模态蒸馏的事件相机单目深度估计方法 distillation depth estimation monocular depth
11 Efficient Multimodal Dataset Distillation via Generative Models 提出EDGE:一种基于生成模型的高效多模态数据集蒸馏方法 distillation large language model multimodal
12 Comparing Computational Pathology Foundation Models using Representational Similarity Analysis 利用表征相似性分析比较计算病理学中的多个预训练模型 contrastive learning distillation foundation model
13 Self-supervised learning of imaging and clinical signatures using a multimodal joint-embedding predictive architecture 利用多模态联合嵌入预测架构的自监督学习提升肺结节诊断 predictive model multimodal
14 NeuroRAD-FM: A Foundation Model for Neuro-Oncology with Distributionally Robust Training NeuroRAD-FM:基于分布鲁棒训练的神经肿瘤学Foundation Model MAE foundation model
15 Beyond Random Masking: A Dual-Stream Approach for Rotation-Invariant Point Cloud Masked Autoencoders 提出双流掩码自编码器,解决点云旋转不变性学习中几何结构和语义一致性缺失问题 masked autoencoder MAE curriculum learning
16 Emulating Human-like Adaptive Vision for Efficient and Flexible Machine Visual Perception 提出AdaptiveNN,通过模仿人类自适应视觉实现高效灵活的机器视觉感知 reinforcement learning representation learning embodied AI

🔬 支柱三:空间感知与语义 (Perception & Semantics) (5 篇)

#题目一句话要点标签🔗
17 Lost in Translation? Vocabulary Alignment for Source-Free Adaptation in Open-Vocabulary Semantic Segmentation VocAlign:面向开放词汇语义分割的无源域自适应词汇对齐方法 open-vocabulary open vocabulary
18 UCorr: Wire Detection and Depth Estimation for Autonomous Drones 提出UCorr,用于自主无人机的细线缆检测与深度估计 depth estimation
19 RGB-Only Supervised Camera Parameter Optimization in Dynamic Scenes 提出ROS-Cam,仅用RGB视频即可高效准确地优化动态场景中的相机参数。 metric depth NeRF
20 Lightweight and Accurate Multi-View Stereo with Confidence-Aware Diffusion Model 提出基于置信度感知扩散模型的高效轻量多视角立体匹配方法 depth estimation
21 SPATIALGEN: Layout-guided 3D Indoor Scene Generation SpatialGen:布局引导的3D室内场景生成模型,解决数据匮乏和控制难题。 scene understanding

🔬 支柱一:机器人控制 (Robot Control) (2 篇)

#题目一句话要点标签🔗
22 RynnVLA-001: Using Human Demonstrations to Improve Robot Manipulation RynnVLA-001:利用人类演示提升机器人操作能力,提出双阶段预训练VLA模型。 manipulation vision-language-action VLA
23 Synthetic-to-Real Object Detection using YOLOv11 and Domain Randomization Strategies 利用YOLOv11和域随机化策略实现从合成数据到真实场景的目标检测 domain randomization

🔬 支柱七:动作重定向 (Motion Retargeting) (1 篇)

#题目一句话要点标签🔗
24 SmolRGPT: Efficient Spatial Reasoning for Warehouse Environments with 600M Parameters SmolRGPT:面向仓库环境的高效空间推理600M参数视觉语言模型 spatial relationship multimodal

⬅️ 返回 cs.CV 首页 · 🏠 返回主页