cs.CV(2025-07-11)

📊 共 31 篇论文 | 🔗 9 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (13 🔗3) 支柱三:空间感知与语义 (Perception & Semantics) (8 🔗1) 支柱二:RL算法与架构 (RL & Architecture) (5 🔗2) 支柱四:生成式动作 (Generative Motion) (2 🔗2) 支柱六:视频提取与匹配 (Video Extraction) (1) 支柱一:机器人控制 (Robot Control) (1) 支柱八:物理动画 (Physics-based Animation) (1 🔗1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (13 篇)

#题目一句话要点标签🔗
1 From Physics to Foundation Models: A Review of AI-Driven Quantitative Remote Sensing Inversion 综述:AI驱动的定量遥感反演,从物理模型到基础模型 foundation model multimodal
2 Unreal is all you need: Multimodal ISAC Data Simulation with Only One Engine Great-X:基于Unreal Engine的多模态ISAC数据高效仿真平台 foundation model multimodal
3 F3-Net: Foundation Model for Full Abnormality Segmentation of Medical Images with Flexible Input Modality Requirement F3-Net:用于医学图像全异常分割的、支持灵活模态输入的Foundation模型 foundation model multimodal
4 Understanding Driving Risks using Large Language Models: Toward Elderly Driver Assessment 利用大型语言模型理解驾驶风险,探索其在老年驾驶员评估中的应用 large language model multimodal
5 Single Domain Generalization for Multimodal Cross-Cancer Prognosis via Dirac Rebalancer and Distribution Entanglement 提出SDIR和CADE模块,解决多模态跨癌预后中的单域泛化问题。 multimodal
6 Raptor: Scalable Train-Free Embeddings for 3D Medical Volumes Leveraging Pretrained 2D Foundation Models Raptor:利用预训练2D基础模型,为3D医学体数据生成可扩展的免训练嵌入。 foundation model
7 Infinite Video Understanding 提出无限视频理解概念,旨在突破现有模型在处理无限时长视频时的计算和记忆瓶颈。 large language model multimodal
8 Visual Semantic Description Generation with MLLMs for Image-Text Matching 提出基于MLLM的视觉语义描述生成方法,提升图文匹配性能。 large language model multimodal
9 CNeuroMod-THINGS, a densely-sampled fMRI dataset for visual neuroscience CNeuroMod-THINGS:一个用于视觉神经科学的密集采样fMRI数据集 multimodal
10 From One to More: Contextual Part Latents for 3D Generation 提出CoPart框架,通过上下文部件潜在表示实现可控3D生成。 foundation model
11 DatasetAgent: A Novel Multi-Agent System for Auto-Constructing Datasets from Real-World Images 提出DatasetAgent,一种基于多智能体系统的真实图像数据集自动构建方法。 large language model
12 A document is worth a structured record: Principled inductive bias design for document recognition 提出一种基于结构化记录的文档识别方法,提升复杂文档的识别精度和泛化性。 foundation model
13 Multi-modal Mutual-Guidance Conditional Prompt Learning for Vision-Language Models 提出MuGCP,通过多模态互指导条件Prompt学习增强视觉-语言模型泛化能力。 large language model

🔬 支柱三:空间感知与语义 (Perception & Semantics) (8 篇)

#题目一句话要点标签🔗
14 ByDeWay: Boost Your multimodal LLM with DEpth prompting in a Training-Free Way ByDeWay:一种免训练的深度提示框架,提升多模态大语言模型的性能 depth estimation monocular depth large language model
15 RePaintGS: Reference-Guided Gaussian Splatting for Realistic and View-Consistent 3D Scene Inpainting 提出RePaintGS,利用参考视图引导的3D高斯溅射实现逼真且视角一致的场景修复 3D gaussian splatting gaussian splatting splatting
16 VISTA: A Visual Analytics Framework to Enhance Foundation Model-Generated Data Labels VISTA:一个视觉分析框架,用于提升基础模型生成的数据标签质量 open-vocabulary open vocabulary foundation model
17 MM-Gesture: Towards Precise Micro-Gesture Recognition through Multimodal Fusion MM-Gesture:通过多模态融合实现精准的微手势识别 optical flow multimodal
18 From images to properties: a NeRF-driven framework for granular material parameter inversion 提出NeRF驱动的颗粒材料参数反演框架,解决视觉观测下的材料属性估计问题 NeRF neural radiance field
19 PanMatch: Unleashing the Potential of Large Vision Models for Unified Matching Models PanMatch:利用大型视觉模型实现统一的匹配模型,解决跨领域匹配问题 optical flow feature matching foundation model
20 Review of Feed-forward 3D Reconstruction: From DUSt3R to VGGT 综述前馈3D重建:从DUSt3R到VGGT,探索单次前向推理的3D场景重建技术。 VGGT
21 One Graph to Track Them All: Dynamic GNNs for Single- and Multi-View Tracking 提出基于动态图神经网络的统一多目标跟踪模型,无需预计算轨迹片段。 scene reconstruction spatiotemporal

🔬 支柱二:RL算法与架构 (RL & Architecture) (5 篇)

#题目一句话要点标签🔗
22 Occlusion-Guided Feature Purification Learning via Reinforced Knowledge Distillation for Occluded Person Re-Identification 提出OGFR,通过强化知识蒸馏解决遮挡行人重识别中的特征污染问题。 reinforcement learning deep reinforcement learning teacher-student
23 VIP: Visual Information Protection through Adversarial Attacks on Vision-Language Models 提出基于对抗攻击的视觉信息保护方法,用于保护视觉-语言模型中的隐私信息。 VIP multimodal
24 MoSAiC: Multi-Modal Multi-Label Supervision-Aware Contrastive Learning for Remote Sensing 提出MoSAiC,利用多模态多标签监督对比学习提升遥感图像表征能力。 representation learning contrastive learning
25 SAM2RL: Towards Reinforcement Learning Memory Control in Segment Anything Model 2 提出SAM2RL,利用强化学习优化SAM2的记忆控制,提升视频目标跟踪性能 reinforcement learning
26 Dual Dimensions Geometric Representation Learning Based Document Dewarping 提出基于双维度几何表征学习的文档图像去畸变方法D2Dewarp representation learning

🔬 支柱四:生成式动作 (Generative Motion) (2 篇)

#题目一句话要点标签🔗
27 Vision Foundation Models as Effective Visual Tokenizers for Autoregressive Image Generation 提出VFMTok,利用视觉基础模型作为图像tokenizer,提升自回归图像生成质量。 classifier-free guidance foundation model
28 M2DAO-Talker: Harmonizing Multi-granular Motion Decoupling and Alternating Optimization for Talking-head Generation M2DAO-Talker:通过多粒度运动解耦和交替优化实现逼真的说话人头部生成 penetration

🔬 支柱六:视频提取与匹配 (Video Extraction) (1 篇)

#题目一句话要点标签🔗
29 Video Inference for Human Mesh Recovery with Vision Transformer 提出HMR-ViT,利用时序和运动学信息提升视频人体网格重建精度 human mesh recovery HMR SMPL

🔬 支柱一:机器人控制 (Robot Control) (1 篇)

#题目一句话要点标签🔗
30 Taming generative video models for zero-shot optical flow extraction 提出KL-tracing,利用生成视频模型零样本提取光流,性能媲美专用模型。 sim-to-real world model optical flow

🔬 支柱八:物理动画 (Physics-based Animation) (1 篇)

#题目一句话要点标签🔗
31 Lumos-1: On Autoregressive Video Generation from a Unified Model Perspective Lumos-1:提出一种统一的自回归视频生成模型,提升生成质量和效率。 spatiotemporal large language model multimodal

⬅️ 返回 cs.CV 首页 · 🏠 返回主页