cs.CV(2025-07-24)

📊 共 31 篇论文 | 🔗 8 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (10 🔗5) 支柱二:RL算法与架构 (RL & Architecture) (9 🔗2) 支柱三:空间感知与语义 (Perception & Semantics) (8 🔗1) 支柱六:视频提取与匹配 (Video Extraction) (3) 支柱七:动作重定向 (Motion Retargeting) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (10 篇)

#题目一句话要点标签🔗
1 LMM-Det: Make Large Multimodal Models Excel in Object Detection 提出LMM-Det,利用大型多模态模型实现无需专用检测模块的目标检测。 multimodal visual grounding
2 GRR-CoCa: Leveraging LLM Mechanisms in Multimodal Model Architectures GRR-CoCa:通过融入LLM机制,提升多模态模型架构性能 large language model multimodal
3 Diffusion-FS: Multimodal Free-Space Prediction via Diffusion for Autonomous Driving 提出Diffusion-FS,通过扩散模型进行多模态自由空间预测,用于自动驾驶。 multimodal
4 Explaining How Visual, Textual and Multimodal Encoders Share Concepts 提出跨模态模型概念共享度量指标,用于比较视觉、文本和多模态编码器的特征表示。 multimodal
5 PDB-Eval: An Evaluation of Large Multimodal Models for Description and Explanation of Personalized Driving Behavior 提出PDB-Eval基准,用于评估大型多模态模型对个性化驾驶行为的理解与解释能力 multimodal
6 A Multimodal Seq2Seq Transformer for Predicting Brain Responses to Naturalistic Stimuli 提出一种多模态Seq2Seq Transformer,用于预测自然刺激下的大脑fMRI反应。 multimodal
7 Captain Cinema: Towards Short Movie Generation Captain Cinema:提出一种短电影生成框架,解决长程一致性和高质量生成问题。 multimodal
8 VideoMind: An Omni-Modal Video Dataset with Intent Grounding for Deep-Cognitive Video Understanding VideoMind:用于深度认知视频理解的意图对齐全模态视频数据集 chain-of-thought
9 IntentVCNet: Bridging Spatio-Temporal Gaps for Intention-Oriented Controllable Video Captioning IntentVCNet通过弥合时空差距,实现意图导向的可控视频字幕生成。 instruction following
10 Towards Effective Human-in-the-Loop Assistive AI Agents 提出人机协作评估框架与AR辅助AI智能体,提升物理任务表现 multimodal

🔬 支柱二:RL算法与架构 (RL & Architecture) (9 篇)

#题目一句话要点标签🔗
11 TeEFusion: Blending Text Embeddings to Distill Classifier-Free Guidance TeEFusion:融合文本嵌入蒸馏无分类器引导,加速文本到图像生成。 distillation classifier-free guidance
12 DiagR1: A Vision-Language Model Trained via Reinforcement Learning for Digestive Pathology Diagnosis DiagR1:通过强化学习训练的消化病理诊断视觉-语言模型 reinforcement learning multimodal
13 HybridTM: Combining Transformer and Mamba for 3D Semantic Segmentation 提出HybridTM,结合Transformer和Mamba用于高效3D语义分割。 Mamba
14 Adversarial Distribution Matching for Diffusion Distillation Towards Efficient Image and Video Synthesis 提出DMDX,通过对抗分布匹配蒸馏提升扩散模型图像和视频合成效率。 distillation
15 Exploiting Gaussian Agnostic Representation Learning with Diffusion Priors for Enhanced Infrared Small Target Detection 提出基于高斯不可知表示学习与扩散先验的红外小目标检测方法 representation learning
16 MatSSL: Robust Self-Supervised Representation Learning for Metallographic Image Segmentation MatSSL:用于金相图像分割的鲁棒自监督表征学习方法 representation learning
17 Unsupervised Domain Adaptation for 3D LiDAR Semantic Segmentation Using Contrastive Learning and Multi-Model Pseudo Labeling 提出基于对比学习和多模型伪标签的LiDAR语义分割无监督域自适应方法 contrastive learning
18 WaveMamba: Wavelet-Driven Mamba Fusion for RGB-Infrared Object Detection WaveMamba:基于小波变换和Mamba融合的RGB-红外目标检测方法 Mamba
19 Datasets and Recipes for Video Temporal Grounding via Reinforcement Learning 提出基于强化学习的两阶段训练框架,提升视频时序定位的准确性和泛化性。 reinforcement learning

🔬 支柱三:空间感知与语义 (Perception & Semantics) (8 篇)

#题目一句话要点标签🔗
20 DepthDark: Robust Monocular Depth Estimation for Low-Light Environments DepthDark:面向低光环境的鲁棒单目深度估计 depth estimation monocular depth foundation model
21 SaLF: Sparse Local Fields for Multi-Sensor Rendering in Real-Time 提出SaLF:一种支持实时多传感器渲染的稀疏局部场表示方法 3D gaussian splatting 3DGS gaussian splatting
22 Unposed 3DGS Reconstruction with Probabilistic Procrustes Mapping 提出基于概率Procrustes映射的无位姿3DGS重建框架,解决大规模场景重建问题。 3D gaussian splatting 3DGS gaussian splatting
23 CRUISE: Cooperative Reconstruction and Editing in V2X Scenarios using Gaussian Splatting CRUISE:基于高斯溅射的V2X场景协同重建与编辑框架 gaussian splatting splatting
24 MVG4D: Image Matrix-Based Multi-View and Motion Generation for 4D Content Creation from a Single Image MVG4D:基于图像矩阵的多视角与运动生成,实现单图驱动的4D内容创建 gaussian splatting splatting motion generation
25 Towards Scalable Spatial Intelligence via 2D-to-3D Data Lifting 提出一种可扩展的2D到3D数据提升流程,解决3D数据稀缺问题,促进空间智能发展。 depth estimation scene understanding
26 LONG3R: Long Sequence Streaming 3D Reconstruction 提出LONG3R以解决长序列流媒体3D重建问题 scene reconstruction
27 BokehDiff: Neural Lens Blur with One-Step Diffusion BokehDiff:利用单步扩散模型实现逼真神经镜头模糊渲染 depth estimation

🔬 支柱六:视频提取与匹配 (Video Extraction) (3 篇)

#题目一句话要点标签🔗
28 Object segmentation in the wild with foundation models: application to vision assisted neuro-prostheses for upper limbs 利用基础模型和眼动注视提示,提升复杂场景下神经假肢的物体分割性能 egocentric foundation model
29 EgoExoBench: A Benchmark for First- and Third-person View Video Understanding in MLLMs EgoExoBench:首个面向多模态大语言模型的第一人称和第三人称视角视频理解基准 egocentric large language model multimodal
30 Learning Efficient and Generalizable Human Representation with Human Gaussian Model 提出Human Gaussian Graph,高效生成可动画的人体高斯模型 SMPL TAMP

🔬 支柱七:动作重定向 (Motion Retargeting) (1 篇)

#题目一句话要点标签🔗
31 Towards Consistent Long-Term Pose Generation 提出一种单阶段姿态生成方法,解决长时序姿态生成中时序一致性问题。 spatial relationship

⬅️ 返回 cs.CV 首页 · 🏠 返回主页