cs.CV(2025-10-14)

📊 共 45 篇论文 | 🔗 12 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (14 🔗1) 支柱二:RL算法与架构 (RL & Architecture) (13 🔗5) 支柱三:空间感知与语义 (Perception & Semantics) (9 🔗2) 支柱四:生成式动作 (Generative Motion) (5 🔗2) 支柱七:动作重定向 (Motion Retargeting) (1 🔗1) 支柱一:机器人控制 (Robot Control) (1) 支柱六:视频提取与匹配 (Video Extraction) (1 🔗1) 支柱八:物理动画 (Physics-based Animation) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (14 篇)

#题目一句话要点标签🔗
1 SpineBench: Benchmarking Multimodal LLMs for Spinal Pathology Analysis SpineBench:用于脊柱病理分析的多模态LLM基准测试 large language model multimodal
2 SRUM: Fine-Grained Self-Rewarding for Unified Multimodal Models 提出SRUM:一种用于统一多模态模型的细粒度自奖励框架 multimodal
3 Personalized Federated Fine-Tuning of Vision Foundation Models for Healthcare 提出个性化联邦微调方法,用于医疗领域视觉基础模型。 foundation model
4 CrossRay3D: Geometry and Distribution Guidance for Efficient Multimodal 3D Detection CrossRay3D:利用几何与分布引导的高效多模态3D检测 multimodal
5 ViCO: A Training Strategy towards Semantic Aware Dynamic High-Resolution ViCO:面向语义感知的动态高分辨率多模态大模型训练策略 large language model multimodal
6 VQArt-Bench: A semantically rich VQA Benchmark for Art and Cultural Heritage 提出VQArt-Bench:一个用于艺术和文化遗产的语义丰富型VQA基准 large language model multimodal
7 IL3D: A Large-Scale Indoor Layout Dataset for LLM-Driven 3D Scene Generation IL3D:用于LLM驱动的3D场景生成的大规模室内布局数据集 large language model multimodal
8 Reasoning in the Dark: Interleaved Vision-Text Reasoning in Latent Space 提出IVT-LR,在隐空间进行交错视觉-文本推理,提升多模态LLM效率。 multimodal
9 Unlocking Zero-Shot Plant Segmentation with Pl@ntNet Intelligence 利用Pl@ntNet知识,实现农业图像零样本植物分割 foundation model
10 VideoLucy: Deep Memory Backtracking for Long Video Understanding VideoLucy:提出深度记忆回溯框架,用于长视频理解,显著提升性能。 large language model
11 Vision Language Models Map Logos to Text via Semantic Entanglement in the Visual Projector 视觉语言模型通过视觉投影器的语义纠缠将Logo映射到文本,易产生幻觉 multimodal
12 HoneyBee: Data Recipes for Vision-Language Reasoners HoneyBee:针对视觉-语言推理器的数据配方,提升模型性能。 chain-of-thought
13 MetaCaptioner: Towards Generalist Visual Captioning with Open-source Suites 提出CapFlow多智能体协作流程,结合MetaCaptioner,实现媲美GPT-4.1的通用视觉描述能力。 multimodal
14 MultiFoodhat: A potential new paradigm for intelligent food quality inspection 提出MultiFoodChat,用于零样本食物识别的对话驱动多智能体推理框架。 large language model

🔬 支柱二:RL算法与架构 (RL & Architecture) (13 篇)

#题目一句话要点标签🔗
15 Learning Human Motion with Temporally Conditional Mamba 提出时序条件Mamba模型,提升时序人体运动生成任务的对齐性和真实感。 Mamba motion generation human motion
16 On the Use of Hierarchical Vision Foundation Models for Low-Cost Human Mesh Recovery and Pose Estimation 利用分层视觉基础模型,实现低成本人体网格重建与姿态估计 Mamba human mesh recovery HMR
17 CompoDistill: Attention Distillation for Compositional Reasoning in Multimodal LLMs 提出CompoDistill,通过注意力蒸馏提升多模态LLM的组合推理能力。 distillation large language model multimodal
18 DeepMMSearch-R1: Empowering Multimodal LLMs in Multimodal Web Search 提出DeepMMSearch-R1以解决多模态LLM在网络搜索中的信息获取问题 reinforcement learning large language model multimodal
19 SAIL-Embedding Technical Report: Omni-modal Embedding Foundation Model SAIL-Embedding:面向真实场景的通用多模态嵌入基础模型 representation learning foundation model multimodal
20 CoIRL-AD: Collaborative-Competitive Imitation-Reinforcement Learning in Latent World Models for Autonomous Driving 提出CoIRL-AD,一种用于自动驾驶的竞争式模仿-强化学习框架 reinforcement learning imitation learning world model
21 Epistemic-aware Vision-Language Foundation Model for Fetal Ultrasound Interpretation 提出FetalMind,用于胎儿超声报告生成和诊断,提升多视图推理和疾病识别能力。 reinforcement learning foundation model
22 DriveVLA-W0: World Models Amplify Data Scaling Law in Autonomous Driving DriveVLA-W0:利用世界模型放大自动驾驶中的数据缩放定律 world model vision-language-action VLA
23 CurriFlow: Curriculum-Guided Depth Fusion with Optical Flow-Based Temporal Alignment for 3D Semantic Scene Completion CurriFlow:基于光流时间对齐与课程学习的深度融合,用于3D语义场景补全 curriculum learning stereo depth optical flow
24 DRL: Discriminative Representation Learning with Parallel Adapters for Class Incremental Learning 提出DRL框架,通过并行适配器和解耦锚点监督,有效解决类增量学习中的表示偏移和不一致性问题。 DRL representation learning
25 One Dimensional CNN ECG Mamba for Multilabel Abnormality Classification in 12 Lead ECG 提出1D CNN ECG Mamba模型,用于12导联心电图多标签异常分类,显著提升AUPRC和AUROC。 Mamba state space model
26 Dual Learning with Dynamic Knowledge Distillation and Soft Alignment for Partially Relevant Video Retrieval 提出基于动态知识蒸馏和软对齐的双重学习框架,用于部分相关视频检索。 distillation
27 State Space Prompting via Gathering and Spreading Spatio-Temporal Information for Video Understanding 提出状态空间提示(SSP)方法,通过时空信息聚合与传播提升视频理解性能。 state space model spatiotemporal

🔬 支柱三:空间感知与语义 (Perception & Semantics) (9 篇)

#题目一句话要点标签🔗
28 UniGS: Unified Geometry-Aware Gaussian Splatting for Multimodal Rendering 提出UniGS以解决高保真多模态3D重建问题 3D gaussian splatting gaussian splatting splatting
29 BSGS: Bi-stage 3D Gaussian Splatting for Camera Motion Deblurring 提出双阶段3D高斯溅射(BSGS)以解决相机运动模糊场景的三维重建问题。 3D gaussian splatting 3DGS gaussian splatting
30 DrivingScene: A Multi-Task Online Feed-Forward 3D Gaussian Splatting Method for Dynamic Driving Scenes 提出DrivingScene,用于动态驾驶场景的在线前馈3D高斯溅射方法 3D gaussian splatting gaussian splatting splatting
31 G4Splat: Geometry-Guided Gaussian Splatting with Generative Prior G4Splat:利用生成先验和几何引导的高质量高斯溅射重建 gaussian splatting splatting scene reconstruction
32 Uncertainty Matters in Dynamic Gaussian Splatting for Monocular 4D Reconstruction 提出USplat4D,通过不确定性建模提升单目动态高斯溅射4D重建效果 gaussian splatting splatting
33 Hybrid Gaussian Splatting for Novel Urban View Synthesis 提出混合高斯溅射方法,用于城市街景的新视角合成 gaussian splatting splatting
34 PAGS: Priority-Adaptive Gaussian Splatting for Dynamic Driving Scenes 提出PAGS,用于动态驾驶场景中具有优先级自适应的高斯溅射重建。 gaussian splatting splatting
35 E-MoFlow: Learning Egomotion and Optical Flow from Event Data via Implicit Regularization E-MoFlow:通过隐式正则化从事件数据中学习自运动和光流 depth estimation optical flow geometric consistency
36 SPORTS: Simultaneous Panoptic Odometry, Rendering, Tracking and Segmentation for Urban Scenes Understanding SPORTS:面向城市场景理解的同步全景里程计、渲染、跟踪与分割 visual odometry scene understanding optical flow

🔬 支柱四:生成式动作 (Generative Motion) (5 篇)

#题目一句话要点标签🔗
37 SceneAdapt: Scene-aware Adaptation of Human Motion Diffusion SceneAdapt:提出场景感知的人体运动扩散模型自适应框架 motion diffusion text-to-motion motion generation
38 Unconditional Human Motion and Shape Generation via Balanced Score-Based Diffusion 通过平衡评分的扩散模型实现无条件人类运动与形状生成 motion generation human motion human motion generation
39 What If : Understanding Motion Through Sparse Interactions 提出Flow Poke Transformer,通过稀疏交互理解场景运动分布 motion generation motion estimation
40 LayerSync: Self-aligning Intermediate Layers LayerSync:提出一种自对齐中间层的扩散模型训练方法,提升生成质量和训练效率。 motion generation
41 Playmate2: Training-Free Multi-Character Audio-Driven Animation via Diffusion Transformer with Reward Feedback Playmate2:基于扩散Transformer和奖励反馈的免训练多角色音频驱动动画 classifier-free guidance character animation foundation model

🔬 支柱七:动作重定向 (Motion Retargeting) (1 篇)

#题目一句话要点标签🔗
42 PET Head Motion Estimation Using Supervised Deep Learning with Attention 提出基于注意力机制的深度学习方法DL-HMC++,用于PET头部运动估计与校正。 motion estimation motion tracking

🔬 支柱一:机器人控制 (Robot Control) (1 篇)

#题目一句话要点标签🔗
43 Beyond Seeing: Evaluating Multimodal LLMs on Tool-Enabled Image Perception, Transformation, and Reasoning 提出VisualToolBench,评估多模态LLM在工具辅助下的图像感知、转换和推理能力 manipulation large language model multimodal

🔬 支柱六:视频提取与匹配 (Video Extraction) (1 篇)

#题目一句话要点标签🔗
44 Learning to Recognize Correctly Completed Procedure Steps in Egocentric Assembly Videos through Spatio-Temporal Modeling 提出STORM-PSR,通过时空建模提升自中心视角装配视频中步骤识别的鲁棒性。 egocentric

🔬 支柱八:物理动画 (Physics-based Animation) (1 篇)

#题目一句话要点标签🔗
45 Hardware-aware Coding Function Design for Compressive Single-Photon 3D Cameras 针对单光子3D相机硬件约束,提出硬件感知的编码函数设计方法 PULSE

⬅️ 返回 cs.CV 首页 · 🏠 返回主页