cs.CV（2025-09-18）

📊 共 24 篇论文 | 🔗 5 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (9 🔗2) 支柱二：RL算法与架构 (RL & Architecture) (7 🔗1) 支柱三：空间感知与语义 (Perception & Semantics) (5 🔗1) 支柱一：机器人控制 (Robot Control) (2) 支柱七：动作重定向 (Motion Retargeting) (1 🔗1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (9 篇)

#	题目	一句话要点	标签	🔗	⭐
1	How Good are Foundation Models in Step-by-Step Embodied Reasoning?	提出FoMER基准，评估具身环境中基础模型逐步推理能力	foundation model multimodal
2	Unleashing the Potential of Multimodal LLMs for Zero-Shot Spatio-Temporal Video Grounding	利用多模态LLM进行零样本时空视频定位，提出DSTH和TAS策略。	large language model multimodal	✅
3	From Pixels to Urban Policy-Intelligence: Recovering Legacy Effects of Redlining with a Multimodal LLM	利用多模态LLM从像素到城市政策智能：重现红线政策的遗留影响	large language model multimodal
4	Two Web Toolkits for Multimodal Piano Performance Dataset Acquisition and Fingering Annotation	提出用于多模态钢琴演奏数据集采集与指法标注的Web工具包	multimodal
5	V-SenseDrive: A Privacy-Preserving Road Video and In-Vehicle Sensor Fusion Framework for Road Safety & Driver Behaviour Modelling	V-SenseDrive：面向道路安全与驾驶行为建模的隐私保护型道路视频与车载传感器融合框架	multimodal
6	ORCA: Agentic Reasoning For Hallucination and Adversarial Robustness in Vision-Language Models	ORCA：通过Agentic推理提升视觉-语言模型在幻觉和对抗鲁棒性上的表现	multimodal
7	ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data	ScaleCUA：通过跨平台数据扩展开源计算机使用Agent	foundation model	✅
8	QuizRank: Picking Images by Quizzing VLMs	QuizRank：利用视觉语言模型进行问答式图像排序，提升维基百科文章配图质量。	large language model
9	Seeing 3D Through 2D Lenses: 3D Few-Shot Class-Incremental Learning via Cross-Modal Geometric Rectification	提出CMGR框架，通过跨模态几何校正实现3D少样本类增量学习。	foundation model

🔬 支柱二：RL算法与架构 (RL & Architecture) (7 篇)

#	题目	一句话要点	标签	🔗	⭐
10	Depth AnyEvent: A Cross-Modal Distillation Paradigm for Event-Based Monocular Depth Estimation	提出基于跨模态蒸馏的事件相机单目深度估计方法	distillation depth estimation monocular depth
11	Efficient Multimodal Dataset Distillation via Generative Models	提出EDGE：一种基于生成模型的高效多模态数据集蒸馏方法	distillation large language model multimodal
12	Comparing Computational Pathology Foundation Models using Representational Similarity Analysis	利用表征相似性分析比较计算病理学中的多个预训练模型	contrastive learning distillation foundation model
13	Self-supervised learning of imaging and clinical signatures using a multimodal joint-embedding predictive architecture	利用多模态联合嵌入预测架构的自监督学习提升肺结节诊断	predictive model multimodal
14	NeuroRAD-FM: A Foundation Model for Neuro-Oncology with Distributionally Robust Training	NeuroRAD-FM：基于分布鲁棒训练的神经肿瘤学Foundation Model	MAE foundation model
15	Beyond Random Masking: A Dual-Stream Approach for Rotation-Invariant Point Cloud Masked Autoencoders	提出双流掩码自编码器，解决点云旋转不变性学习中几何结构和语义一致性缺失问题	masked autoencoder MAE curriculum learning
16	Emulating Human-like Adaptive Vision for Efficient and Flexible Machine Visual Perception	提出AdaptiveNN，通过模仿人类自适应视觉实现高效灵活的机器视觉感知	reinforcement learning representation learning embodied AI	✅

🔬 支柱三：空间感知与语义 (Perception & Semantics) (5 篇)

#	题目	一句话要点	标签	🔗	⭐
17	Lost in Translation? Vocabulary Alignment for Source-Free Adaptation in Open-Vocabulary Semantic Segmentation	VocAlign：面向开放词汇语义分割的无源域自适应词汇对齐方法	open-vocabulary open vocabulary
18	UCorr: Wire Detection and Depth Estimation for Autonomous Drones	提出UCorr，用于自主无人机的细线缆检测与深度估计	depth estimation
19	RGB-Only Supervised Camera Parameter Optimization in Dynamic Scenes	提出ROS-Cam，仅用RGB视频即可高效准确地优化动态场景中的相机参数。	metric depth NeRF
20	Lightweight and Accurate Multi-View Stereo with Confidence-Aware Diffusion Model	提出基于置信度感知扩散模型的高效轻量多视角立体匹配方法	depth estimation	✅
21	SPATIALGEN: Layout-guided 3D Indoor Scene Generation	SpatialGen：布局引导的3D室内场景生成模型，解决数据匮乏和控制难题。	scene understanding

🔬 支柱一：机器人控制 (Robot Control) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
22	RynnVLA-001: Using Human Demonstrations to Improve Robot Manipulation	RynnVLA-001：利用人类演示提升机器人操作能力，提出双阶段预训练VLA模型。	manipulation vision-language-action VLA
23	Synthetic-to-Real Object Detection using YOLOv11 and Domain Randomization Strategies	利用YOLOv11和域随机化策略实现从合成数据到真实场景的目标检测	domain randomization

🔬 支柱七：动作重定向 (Motion Retargeting) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
24	SmolRGPT: Efficient Spatial Reasoning for Warehouse Environments with 600M Parameters	SmolRGPT：面向仓库环境的高效空间推理600M参数视觉语言模型	spatial relationship multimodal	✅

⬅️ 返回 cs.CV 首页 · 🏠 返回主页