cs.CV(2025-11-21)

📊 共 58 篇论文 | 🔗 11 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (18 🔗2) 支柱二:RL算法与架构 (RL & Architecture) (17 🔗4) 支柱三:空间感知与语义 (Perception & Semantics) (10 🔗2) 支柱一:机器人控制 (Robot Control) (5 🔗1) 支柱三:空间感知 (Perception & SLAM) (3) 支柱七:动作重定向 (Motion Retargeting) (2) 支柱五:交互与反应 (Interaction & Reaction) (1 🔗1) 支柱六:视频提取与匹配 (Video Extraction) (1) 支柱四:生成式动作 (Generative Motion) (1 🔗1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (18 篇)

#题目一句话要点标签🔗
1 VisReason: A Large-Scale Dataset for Visual Chain-of-Thought Reasoning VisReason:用于视觉链式思考推理的大规模数据集,提升多模态大语言模型的推理能力。 large language model multimodal chain-of-thought
2 Downscaling Intelligence: Exploring Perception and Reasoning Bottlenecks in Small Multimodal Models 提出Extract+Think方法,提升小型多模态模型在感知和推理上的效率与性能。 large language model multimodal
3 REMSA: An LLM Agent for Foundation Model Selection in Remote Sensing REMSA:基于LLM的遥感领域基础模型自动选择Agent foundation model multimodal
4 SpatialGeo:Boosting Spatial Reasoning in Multimodal LLMs via Geometry-Semantics Fusion SpatialGeo:通过几何-语义融合增强多模态LLM的空间推理能力 large language model multimodal
5 PathAgent: Toward Interpretable Analysis of Whole-slide Pathology Images via Large Language Model-based Agentic Reasoning PathAgent:基于大语言模型Agent的病理切片图像可解释分析 large language model chain-of-thought
6 Pillar-0: A New Frontier for Radiology Foundation Models Pillar-0:构建大规模放射学影像基础模型,提升临床诊断性能 foundation model
7 Attention Guided Alignment in Efficient Vision-Language Models 提出AGE-VLM,通过注意力引导对齐提升高效视觉-语言模型性能,减少幻觉。 large language model multimodal visual grounding
8 Continual Alignment for SAM: Rethinking Foundation Models for Medical Image Segmentation in Continual Learning 提出CA-SAM,通过持续对齐策略提升SAM在医学图像分割中的持续学习能力。 foundation model
9 Navigating in the Dark: A Multimodal Framework and Dataset for Nighttime Traffic Sign Recognition 提出LENS-Net和INTSD数据集,解决夜间交通标志识别难题 multimodal
10 ChainV: Atomic Visual Hints Make Multimodal Reasoning Shorter and Better ChainV:通过原子视觉提示缩短并优化多模态推理 multimodal
11 Vision-Motion-Reference Alignment for Referring Multi-Object Tracking via Multi-Modal Large Language Models 提出VMRMOT框架,利用多模态大语言模型解决Referring多目标跟踪中动态信息缺失问题 large language model
12 UniModel: A Visual-Only Framework for Unified Multimodal Understanding and Generation UniModel:提出一种视觉统一框架,用于多模态理解与生成任务 multimodal
13 OmniPT: Unleashing the Potential of Large Vision Language Models for Pedestrian Tracking and Understanding OmniPT:利用大型视觉语言模型进行行人跟踪与语义理解的统一框架 foundation model visual grounding
14 OmniGround: A Comprehensive Spatio-Temporal Grounding Benchmark for Real-World Complex Scenarios OmniGround:提出一个面向真实复杂场景的时空定位综合基准。 large language model multimodal
15 Understanding Counting Mechanisms in Large Language and Vision-Language Models 通过可控实验与机制可解释性分析LLM/LVLM中的计数机制 large language model
16 MatPedia: A Universal Generative Foundation for High-Fidelity Material Synthesis MatPedia:用于高保真材质合成的通用生成基础模型 foundation model
17 FingerCap: Fine-grained Finger-level Hand Motion Captioning 提出FingerCap,用于生成精细的手指级别动作描述,并构建了大规模数据集。 multimodal
18 MultiPriv: Benchmarking Individual-Level Privacy Reasoning in Vision-Language Models MultiPriv:首个评估视觉-语言模型中个体隐私推理能力的基准测试。 multimodal

🔬 支柱二:RL算法与架构 (RL & Architecture) (17 篇)

#题目一句话要点标签🔗
19 Improving Multimodal Distillation for 3D Semantic Segmentation under Domain Shift 提出多模态蒸馏方法,提升3D语义分割在域偏移下的性能 distillation foundation model multimodal
20 UAM: A Unified Attention-Mamba Backbone of Multimodal Framework for Tumor Cell Classification 提出UAM:一种用于肿瘤细胞分类的多模态统一注意力-Mamba骨干网络 Mamba foundation model multimodal
21 MMT-ARD: Multimodal Multi-Teacher Adversarial Distillation for Robust Vision-Language Models 提出MMT-ARD,通过多模态多教师对抗蒸馏提升视觉-语言模型的鲁棒性。 distillation multimodal
22 FireScope: Wildfire Risk Prediction with a Chain-of-Thought Oracle 提出FireScope,利用链式思考Oracle预测野火风险,提升跨洲泛化能力与可解释性。 reinforcement learning multimodal chain-of-thought
23 Toward explainable AI approaches for breast imaging: adapting foundation models to diverse populations 利用BiomedCLIP,针对不同人群的乳腺影像可解释AI方法 contrastive learning foundation model
24 MCMoE: Completing Missing Modalities with Mixture of Experts for Incomplete Multimodal Action Quality Assessment 提出MCMoE模型,通过专家混合补全缺失模态,提升不完整多模态动作质量评估性能。 representation learning multimodal
25 Counterfactual World Models via Digital Twin-conditioned Video Diffusion 提出CWMDT,通过数字孪生和视频扩散模型实现反事实世界建模 world model large language model
26 MolSight: Optical Chemical Structure Recognition with SMILES Pretraining, Multi-Granularity Learning and Reinforcement Learning MolSight:结合SMILES预训练、多粒度学习和强化学习的光学化学结构识别方法 reinforcement learning large language model
27 RL-AD-Net: Reinforcement Learning Guided Adaptive Displacement in Latent Space for Refined Point Cloud Completion 提出RL-AD-Net,通过强化学习引导潜在空间自适应位移,优化点云补全的局部几何一致性。 reinforcement learning geometric consistency
28 R-AVST: Empowering Video-LLMs with Fine-Grained Spatio-Temporal Reasoning in Complex Audio-Visual Scenarios 提出R-AVST数据集和AVST-Zero模型,增强视频LLM在复杂视听场景下的时空推理能力 reinforcement learning large language model multimodal
29 Scaling Self-Supervised and Cross-Modal Pretraining for Volumetric CT Transformers SPECTRE:用于体积CT图像Transformer的自监督和跨模态预训练 contrastive learning distillation foundation model
30 Target-Bench: Can World Models Achieve Mapless Path Planning with Semantic Targets? Target-Bench:评估世界模型在语义目标下的无地图路径规划能力 world model
31 Importance-Weighted Non-IID Sampling for Flow Matching Models 提出重要性加权非独立同分布采样方法,提升Flow Matching模型输出期望的估计精度。 flow matching
32 Video-R4: Reinforcing Text-Rich Video Reasoning with Visual Rumination 提出Video-R4,通过视觉沉思增强文本丰富视频推理能力 reinforcement learning multimodal
33 DReX: Pure Vision Fusion of Self-Supervised and Convolutional Representations for Image Complexity Prediction DReX:融合自监督和卷积表征的纯视觉图像复杂度预测模型 MAE multimodal
34 Neighbor GRPO: Contrastive ODE Policy Optimization Aligns Flow Models 提出Neighbor GRPO,通过对比学习优化Flow模型,提升生成质量与效率。 flow matching contrastive learning
35 Parts-Mamba: Augmenting Joint Context with Part-Level Scanning for Occluded Human Skeleton 提出Parts-Mamba模型,增强骨骼动作识别在遮挡场景下的性能 Mamba

🔬 支柱三:空间感知与语义 (Perception & Semantics) (10 篇)

#题目一句话要点标签🔗
36 PEGS: Physics-Event Enhanced Large Spatiotemporal Motion Reconstruction via 3D Gaussian Splatting PEGS:基于物理事件增强的3D高斯溅射实现大时空运动重建 3D gaussian splatting gaussian splatting splatting
37 Gradient-Driven Natural Selection for Compact 3D Gaussian Splatting 提出梯度驱动的自然选择方法,用于紧凑型3D高斯溅射,提升渲染质量。 3D gaussian splatting 3DGS gaussian splatting
38 NoPe-NeRF++: Local-to-Global Optimization of NeRF with No Pose Prior NoPe-NeRF++:无需位姿先验的NeRF局部到全局优化 NeRF neural radiance field feature matching
39 SPAGS: Sparse-View Articulated Object Reconstruction from Single State via Planar Gaussian Splatting 提出SPAGS,通过平面高斯溅射实现单状态稀疏视角下的铰接物体重建 gaussian splatting splatting
40 REArtGS++: Generalizable Articulation Reconstruction with Temporal Geometry Constraint via Planar Gaussian Splatting REArtGS++:基于平面高斯 Splatting 和时序几何约束的通用铰链物体重建 gaussian splatting splatting
41 CORA: Consistency-Guided Semi-Supervised Framework for Reasoning Segmentation 提出CORA框架,利用一致性指导的半监督学习进行推理分割 scene understanding multimodal instruction following
42 DepthFocus: Controllable Depth Estimation for See-Through Scenes 提出DepthFocus,通过可控深度估计实现透视场景的选择性感知。 depth estimation stereo depth
43 AEGIS: Preserving privacy of 3D Facial Avatars with Adversarial Perturbations AEGIS:通过对抗扰动保护3D面部头像的隐私 3D gaussian splatting gaussian splatting splatting
44 FisheyeGaussianLift: BEV Feature Lifting for Surround-View Fisheye Camera Perception 提出FisheyeGaussianLift,解决鱼眼相机BEV语义分割中的畸变和不确定性问题 splatting semantic map
45 SuperQuadricOcc: Multi-Layer Gaussian Approximation of Superquadrics for Real-Time Self-Supervised Occupancy Estimation SuperQuadricOcc:基于超二次曲面的多层高斯近似,实现实时自监督占据估计 scene understanding

🔬 支柱一:机器人控制 (Robot Control) (5 篇)

#题目一句话要点标签🔗
46 VLA-4D: Embedding 4D Awareness into Vision-Language-Action Models for SpatioTemporally Coherent Robotic Manipulation 提出VLA-4D以解决时空一致性机器人操控问题 manipulation spatiotemporal vision-language-action
47 Planning with Sketch-Guided Verification for Physics-Aware Video Generation 提出SketchVerify框架,通过草图引导的验证提升物理感知视频生成的运动规划质量。 motion planning world model physically plausible
48 QAL: A Loss for Recall Precision Balance in 3D Reconstruction 提出QAL以解决3D重建中的召回与精度平衡问题 manipulation
49 Show Me: Unifying Instructional Image and Video Generation with Diffusion Models ShowMe:利用扩散模型统一指令图像和视频生成任务 manipulation
50 PostCam: Camera-Controllable Novel-View Video Generation with Query-Shared Cross-Attention PostCam:基于查询共享交叉注意力的相机可控新视角视频生成 manipulation

🔬 支柱三:空间感知 (Perception & SLAM) (3 篇)

#题目一句话要点标签🔗
51 Flow-Guided Implicit Neural Representation for Motion-Aware Dynamic MRI Reconstruction 提出基于光流引导的隐式神经表示,用于运动感知动态磁共振重建 optical flow
52 DeltaDeno: Zero-Shot Anomaly Generation via Delta-Denoising Attribution DeltaDeno:提出一种基于Delta去噪归因的零样本异常生成方法。 localization
53 Glass Surface Detection: Leveraging Reflection Dynamics in Flash/No-flash Imagery 提出NFGlassNet,利用闪光/非闪光图像中的反射动态特性进行玻璃表面检测 localization

🔬 支柱七:动作重定向 (Motion Retargeting) (2 篇)

#题目一句话要点标签🔗
54 The Potential and Limitations of Vision-Language Models for Human Motion Understanding: A Case Study in Data-Driven Stroke Rehabilitation 利用视觉-语言模型进行中风康复数据驱动分析的潜力与局限性研究 human motion
55 SING3R-SLAM: Submap-based Indoor Monocular Gaussian SLAM with 3D Reconstruction Priors 提出SING3R-SLAM以解决室内单目SLAM中的几何一致性问题 geometric consistency

🔬 支柱五:交互与反应 (Interaction & Reaction) (1 篇)

#题目一句话要点标签🔗
56 RacketVision: A Multiple Racket Sports Benchmark for Unified Ball and Racket Analysis RacketVision:统一球和球拍分析的多球拍运动基准数据集 human-object interaction multimodal

🔬 支柱六:视频提取与匹配 (Video Extraction) (1 篇)

#题目一句话要点标签🔗
57 SPIDER: Spatial Image CorresponDence Estimator for Robust Calibration SPIDER:用于鲁棒标定的空间图像对应估计器,提升跨域图像匹配性能 feature matching foundation model

🔬 支柱四:生成式动作 (Generative Motion) (1 篇)

#题目一句话要点标签🔗
58 OmniLens++: Blind Lens Aberration Correction via Large LensLib Pre-Training and Latent PSF Representation OmniLens++:基于大规模LensLib预训练和潜在PSF表示的盲透镜像差校正 VQ-VAE

⬅️ 返回 cs.CV 首页 · 🏠 返回主页