cs.CV（2025-07-30）

📊 共 30 篇论文 | 🔗 11 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (14 🔗6) 支柱三：空间感知与语义 (Perception & Semantics) (7 🔗2) 支柱二：RL算法与架构 (RL & Architecture) (6 🔗3) 支柱一：机器人控制 (Robot Control) (2) 支柱六：视频提取与匹配 (Video Extraction) (1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (14 篇)

#	题目	一句话要点	标签	🔗	⭐
1	Reference-Guided Diffusion Inpainting For Multimodal Counterfactual Generation	提出MObI和AnydoorMed，实现参考图像引导的多模态扩散模型图像修复与生成。	foundation model multimodal
2	A Large Language Model Powered Integrated Circuit Footprint Geometry Understanding	提出LLM4-IC8K框架，利用大语言模型解决集成电路封装几何尺寸理解难题。	large language model multimodal
3	Zero-Shot Image Anomaly Detection Using Generative Foundation Models	利用生成式预训练模型实现零样本图像异常检测	foundation model
4	Universally Unfiltered and Unseen:Input-Agnostic Multimodal Jailbreaks against Text-to-Image Model Safeguards	提出U3-Attack，一种通用的、输入无关的多模态对抗攻击，用于绕过文本到图像模型的安全防护。	multimodal
5	Gems: Group Emotion Profiling Through Multimodal Situational Understanding	GEMS：通过多模态情境理解进行群体情绪分析	multimodal	✅
6	DeltaVLM: Interactive Remote Sensing Image Change Analysis via Instruction-guided Difference Perception	DeltaVLM：通过指令引导的差异感知实现交互式遥感图像变化分析	large language model multimodal instruction following	✅
7	What is Beneath Misogyny: Misogynous Memes Classification and Explanation	提出MM-Misogyny模型，用于检测、分类和解释网络仇恨女性的梗图	large language model multimodal	✅
8	Goal-Based Vision-Language Driving	NovaDrive：基于视觉语言模型的单分支自动驾驶架构，提升安全性与效率	embodied AI
9	Vocabulary-free Fine-grained Visual Recognition via Enriched Contextually Grounded Vision-Language Model	提出E-FineR，一种基于上下文增强视觉-语言模型的免词汇细粒度图像识别方法。	large language model	✅
10	Towards Omnimodal Expressions and Reasoning in Referring Audio-Visual Segmentation	提出OmniAVS数据集和OISA模型，用于解决多模态融合的指代音视频分割任务。	multimodal
11	MoCHA: Advanced Vision-Language Reasoning with MoE Connector and Hierarchical Group Attention	提出MoCHA以解决视觉语言模型的训练与推理成本问题	large language model
12	Advancing Fetal Ultrasound Image Quality Assessment in Low-Resource Settings	提出FetalCLIP$_{CLS}$，利用胎儿超声图像基础模型提升低资源环境下的图像质量评估。	foundation model	✅
13	Segment Anything for Video: A Comprehensive Review of Video Object Segmentation and Tracking from Past to Future	综述基于SAM的视频目标分割与跟踪方法，展望未来发展趋势	foundation model
14	A Linear N-Point Solver for Structure and Motion from Asynchronous Tracks	提出一种线性N点解算器，用于从异步轨迹中进行结构和运动估计	TAMP	✅

🔬 支柱三：空间感知与语义 (Perception & Semantics) (7 篇)

#	题目	一句话要点	标签	🔗	⭐
15	Robust and Efficient 3D Gaussian Splatting for Urban Scene Reconstruction	提出REUrbanGS框架，实现鲁棒高效的城市级场景3D高斯重建与实时渲染。	3D gaussian splatting gaussian splatting splatting	✅
16	UFV-Splatter: Pose-Free Feed-Forward 3D Gaussian Splatting Adapted to Unfavorable Views	UFV-Splatter：用于不利视角的三维高斯溅射快速前馈方法	3D gaussian splatting 3DGS gaussian splatting
17	Details Matter for Indoor Open-vocabulary 3D Instance Segmentation	针对室内开放词汇3D实例分割，提出细节增强方案，显著提升性能。	open-vocabulary open vocabulary
18	Adaptive Time-step Training for Enhancing Spike-Based Neural Radiance Fields	提出PATA：一种自适应时间步长的脉冲NeRF训练方法，提升资源受限场景下的渲染效率。	NeRF neural radiance field
19	DepR: Depth Guided Single-view Scene Reconstruction with Instance-level Diffusion	DepR：深度引导的单视图场景重建，融合实例级扩散模型	scene reconstruction
20	A Dual-Feature Extractor Framework for Accurate Back Depth and Spine Morphology Estimation from Monocular RGB Images	提出双特征提取框架GAMA-Net，用于单目RGB图像脊柱形态精准评估	depth estimation
21	Estimating 2D Camera Motion with Hybrid Motion Basis	CamFlow：利用混合运动基估计2D相机运动，提升复杂场景鲁棒性	optical flow	✅

🔬 支柱二：RL算法与架构 (RL & Architecture) (6 篇)

#	题目	一句话要点	标签	🔗	⭐
22	VL-Cogito: Progressive Curriculum Reinforcement Learning for Advanced Multimodal Reasoning	VL-Cogito：通过渐进课程强化学习提升多模态推理能力	reinforcement learning large language model multimodal
23	ScreenCoder: Advancing Visual-to-Code Generation for Front-End Automation via Modular Multimodal Agents	ScreenCoder：通过模块化多模态Agent提升视觉到代码的生成，用于前端自动化	reinforcement learning large language model multimodal	✅
24	LIDAR: Lightweight Adaptive Cue-Aware Fusion Vision Mamba for Multimodal Segmentation of Structural Cracks	提出LIDAR：轻量级自适应线索感知视觉Mamba网络，用于结构裂缝的多模态分割。	Mamba multimodal	✅
25	Bridging the Gap in Missing Modalities: Leveraging Knowledge Distillation and Style Matching for Brain Tumor Segmentation	MST-KDNet：利用知识蒸馏和风格匹配解决缺失模态下的脑肿瘤分割难题	distillation feature matching	✅
26	GVD: Guiding Video Diffusion Model for Scalable Video Distillation	提出GVD：一种引导视频扩散模型，用于可扩展的视频数据集蒸馏。	distillation
27	MINR: Implicit Neural Representations with Masked Image Modelling	提出MINR框架，结合隐式神经表示与掩码图像建模，提升图像重建的鲁棒性和泛化性。	masked autoencoder MAE

🔬 支柱一：机器人控制 (Robot Control) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
28	MRpro - open PyTorch-based MR reconstruction and processing package	MRpro：基于PyTorch的开源磁共振重建与处理软件包，促进科研协作与可复现性。	manipulation
29	Bi-Level Optimization for Self-Supervised AI-Generated Face Detection	提出基于双层优化的自监督AI生成人脸检测方法，提升对未知生成器的泛化性。	manipulation

🔬 支柱六：视频提取与匹配 (Video Extraction) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
30	Modality-Aware Feature Matching: A Comprehensive Review of Single- and Cross-Modality Techniques	模态感知特征匹配综述：全面回顾单模态与跨模态技术	feature matching

⬅️ 返回 cs.CV 首页 · 🏠 返回主页