cs.CV（2025-07-11）

📊 共 31 篇论文 | 🔗 9 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (13 🔗3) 支柱三：空间感知与语义 (Perception & Semantics) (8 🔗1) 支柱二：RL算法与架构 (RL & Architecture) (5 🔗2) 支柱四：生成式动作 (Generative Motion) (2 🔗2) 支柱六：视频提取与匹配 (Video Extraction) (1) 支柱一：机器人控制 (Robot Control) (1) 支柱八：物理动画 (Physics-based Animation) (1 🔗1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (13 篇)

#	题目	一句话要点	标签	🔗	⭐
1	From Physics to Foundation Models: A Review of AI-Driven Quantitative Remote Sensing Inversion	综述：AI驱动的定量遥感反演，从物理模型到基础模型	foundation model multimodal
2	Unreal is all you need: Multimodal ISAC Data Simulation with Only One Engine	Great-X：基于Unreal Engine的多模态ISAC数据高效仿真平台	foundation model multimodal	✅
3	F3-Net: Foundation Model for Full Abnormality Segmentation of Medical Images with Flexible Input Modality Requirement	F3-Net：用于医学图像全异常分割的、支持灵活模态输入的Foundation模型	foundation model multimodal
4	Understanding Driving Risks using Large Language Models: Toward Elderly Driver Assessment	利用大型语言模型理解驾驶风险，探索其在老年驾驶员评估中的应用	large language model multimodal
5	Single Domain Generalization for Multimodal Cross-Cancer Prognosis via Dirac Rebalancer and Distribution Entanglement	提出SDIR和CADE模块，解决多模态跨癌预后中的单域泛化问题。	multimodal	✅
6	Raptor: Scalable Train-Free Embeddings for 3D Medical Volumes Leveraging Pretrained 2D Foundation Models	Raptor：利用预训练2D基础模型，为3D医学体数据生成可扩展的免训练嵌入。	foundation model
7	Infinite Video Understanding	提出无限视频理解概念，旨在突破现有模型在处理无限时长视频时的计算和记忆瓶颈。	large language model multimodal
8	Visual Semantic Description Generation with MLLMs for Image-Text Matching	提出基于MLLM的视觉语义描述生成方法，提升图文匹配性能。	large language model multimodal	✅
9	CNeuroMod-THINGS, a densely-sampled fMRI dataset for visual neuroscience	CNeuroMod-THINGS：一个用于视觉神经科学的密集采样fMRI数据集	multimodal
10	From One to More: Contextual Part Latents for 3D Generation	提出CoPart框架，通过上下文部件潜在表示实现可控3D生成。	foundation model
11	DatasetAgent: A Novel Multi-Agent System for Auto-Constructing Datasets from Real-World Images	提出DatasetAgent，一种基于多智能体系统的真实图像数据集自动构建方法。	large language model
12	A document is worth a structured record: Principled inductive bias design for document recognition	提出一种基于结构化记录的文档识别方法，提升复杂文档的识别精度和泛化性。	foundation model
13	Multi-modal Mutual-Guidance Conditional Prompt Learning for Vision-Language Models	提出MuGCP，通过多模态互指导条件Prompt学习增强视觉-语言模型泛化能力。	large language model

🔬 支柱三：空间感知与语义 (Perception & Semantics) (8 篇)

#	题目	一句话要点	标签	🔗	⭐
14	ByDeWay: Boost Your multimodal LLM with DEpth prompting in a Training-Free Way	ByDeWay：一种免训练的深度提示框架，提升多模态大语言模型的性能	depth estimation monocular depth large language model
15	RePaintGS: Reference-Guided Gaussian Splatting for Realistic and View-Consistent 3D Scene Inpainting	提出RePaintGS，利用参考视图引导的3D高斯溅射实现逼真且视角一致的场景修复	3D gaussian splatting gaussian splatting splatting
16	VISTA: A Visual Analytics Framework to Enhance Foundation Model-Generated Data Labels	VISTA：一个视觉分析框架，用于提升基础模型生成的数据标签质量	open-vocabulary open vocabulary foundation model
17	MM-Gesture: Towards Precise Micro-Gesture Recognition through Multimodal Fusion	MM-Gesture：通过多模态融合实现精准的微手势识别	optical flow multimodal	✅
18	From images to properties: a NeRF-driven framework for granular material parameter inversion	提出NeRF驱动的颗粒材料参数反演框架，解决视觉观测下的材料属性估计问题	NeRF neural radiance field
19	PanMatch: Unleashing the Potential of Large Vision Models for Unified Matching Models	PanMatch：利用大型视觉模型实现统一的匹配模型，解决跨领域匹配问题	optical flow feature matching foundation model
20	Review of Feed-forward 3D Reconstruction: From DUSt3R to VGGT	综述前馈3D重建：从DUSt3R到VGGT，探索单次前向推理的3D场景重建技术。	VGGT
21	One Graph to Track Them All: Dynamic GNNs for Single- and Multi-View Tracking	提出基于动态图神经网络的统一多目标跟踪模型，无需预计算轨迹片段。	scene reconstruction spatiotemporal

🔬 支柱二：RL算法与架构 (RL & Architecture) (5 篇)

#	题目	一句话要点	标签	🔗	⭐
22	Occlusion-Guided Feature Purification Learning via Reinforced Knowledge Distillation for Occluded Person Re-Identification	提出OGFR，通过强化知识蒸馏解决遮挡行人重识别中的特征污染问题。	reinforcement learning deep reinforcement learning teacher-student
23	VIP: Visual Information Protection through Adversarial Attacks on Vision-Language Models	提出基于对抗攻击的视觉信息保护方法，用于保护视觉-语言模型中的隐私信息。	VIP multimodal	✅
24	MoSAiC: Multi-Modal Multi-Label Supervision-Aware Contrastive Learning for Remote Sensing	提出MoSAiC，利用多模态多标签监督对比学习提升遥感图像表征能力。	representation learning contrastive learning
25	SAM2RL: Towards Reinforcement Learning Memory Control in Segment Anything Model 2	提出SAM2RL，利用强化学习优化SAM2的记忆控制，提升视频目标跟踪性能	reinforcement learning
26	Dual Dimensions Geometric Representation Learning Based Document Dewarping	提出基于双维度几何表征学习的文档图像去畸变方法D2Dewarp	representation learning	✅

🔬 支柱四：生成式动作 (Generative Motion) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
27	Vision Foundation Models as Effective Visual Tokenizers for Autoregressive Image Generation	提出VFMTok，利用视觉基础模型作为图像tokenizer，提升自回归图像生成质量。	classifier-free guidance foundation model	✅
28	M2DAO-Talker: Harmonizing Multi-granular Motion Decoupling and Alternating Optimization for Talking-head Generation	M2DAO-Talker：通过多粒度运动解耦和交替优化实现逼真的说话人头部生成	penetration	✅

🔬 支柱六：视频提取与匹配 (Video Extraction) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
29	Video Inference for Human Mesh Recovery with Vision Transformer	提出HMR-ViT，利用时序和运动学信息提升视频人体网格重建精度	human mesh recovery HMR SMPL

🔬 支柱一：机器人控制 (Robot Control) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
30	Taming generative video models for zero-shot optical flow extraction	提出KL-tracing，利用生成视频模型零样本提取光流，性能媲美专用模型。	sim-to-real world model optical flow

🔬 支柱八：物理动画 (Physics-based Animation) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
31	Lumos-1: On Autoregressive Video Generation from a Unified Model Perspective	Lumos-1：提出一种统一的自回归视频生成模型，提升生成质量和效率。	spatiotemporal large language model multimodal	✅

⬅️ 返回 cs.CV 首页 · 🏠 返回主页