cs.CV(2026-03-13)

📊 共 46 篇论文 | 🔗 16 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (14 🔗5) 支柱二:RL算法与架构 (RL & Architecture) (12 🔗4) 支柱三:空间感知与语义 (Perception & Semantics) (8 🔗2) 支柱一:机器人控制 (Robot Control) (4 🔗1) 支柱六:视频提取与匹配 (Video Extraction) (4 🔗2) 支柱四:生成式动作 (Generative Motion) (2 🔗1) 支柱七:动作重定向 (Motion Retargeting) (2 🔗1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (14 篇)

#题目一句话要点标签🔗
1 Thinking in Dynamics: How Multimodal Large Language Models Perceive, Track, and Reason Dynamics in Physical 4D World 提出Dyn-Bench基准,评估多模态大语言模型在物理4D世界中的动态感知、跟踪和推理能力。 large language model multimodal chain-of-thought
2 Towards Faithful Multimodal Concept Bottleneck Models 提出f-CBM,一种忠实的多模态概念瓶颈模型,提升概念检测并减少信息泄露。 multimodal
3 Mitigating Memorization in Text-to-Image Diffusion via Region-Aware Prompt Augmentation and Multimodal Copy Detection 提出区域感知提示增强和多模态复制检测,缓解文本到图像扩散模型中的记忆化问题。 multimodal
4 Multimodal OCR: Parse Anything from Documents 提出Multimodal OCR,统一解析文档中的文本和图形元素。 multimodal
5 Multimodal Protein Language Models for Enzyme Kinetic Parameters: From Substrate Recognition to Conformational Adaptation 提出ERBA模型,通过多模态蛋白质语言模型预测酶的动力学参数,提升酶与底物结合效率预测。 multimodal
6 HIFICL: High-Fidelity In-Context Learning for Multimodal Tasks 提出HIFICL,通过高保真上下文学习提升多模态任务性能 multimodal
7 UNIStainNet: Foundation-Model-Guided Virtual Staining of H&E to IHC UNIStainNet:利用病理学基础模型引导H&E图像虚拟染色为IHC,实现多marker统一建模 foundation model
8 Prompt-Driven Lightweight Foundation Model for Instance Segmentation-Based Fault Detection in Freight Trains 提出基于自提示轻量级基础模型的货运列车故障检测实例分割框架 foundation model
9 Geometry-Guided Camera Motion Understanding in VideoLLMs 提出CameraMotionVQA基准与几何引导注入方法,提升VideoLLM对相机运动的理解 VLA foundation model
10 Generalized Recognition of Basic Surgical Actions Enables Skill Assessment and Vision-Language-Model-based Surgical Planning 构建通用手术动作识别模型,促进技能评估与基于视觉-语言模型的手术规划 large language model foundation model
11 Test-Time Attention Purification for Backdoored Large Vision Language Models 提出CleanSight,一种针对后门大视觉语言模型的测试时注意力净化防御方法。 multimodal
12 SAP: Segment Any 4K Panorama 提出SAP以解决360°全景图像实例分割问题 foundation model
13 A2Z-10M+: Geometric Deep Learning with A-to-Z BRep Annotations for AI-Assisted CAD Modeling and Reverse Engineering A2Z-10M+:利用A-to-Z BRep标注的几何深度学习,辅助AI驱动的CAD建模与逆向工程。 foundation model
14 Spatio-Semantic Expert Routing Architecture with Mixture-of-Experts for Referring Image Segmentation 提出SERA:一种用于指代图像分割的空时语义专家路由架构 multimodal

🔬 支柱二:RL算法与架构 (RL & Architecture) (12 篇)

#题目一句话要点标签🔗
15 TerraFlow: Multimodal, Multitemporal Representation Learning for Earth Observation TerraFlow:用于地球观测的多模态、多时相表征学习方法 representation learning foundation model multimodal
16 VGGT-World: Transforming VGGT into an Autoregressive Geometry World Model VGGT-World:提出一种基于几何特征自回归预测的几何世界模型,提升深度预测效率。 flow matching world model VGGT
17 Team RAS in 10th ABAW Competition: Multimodal Valence and Arousal Estimation Approach Team RAS提出多模态融合方法,用于野外环境下valence和arousal的连续情感识别。 Mamba multimodal
18 Team LEYA in 10th ABAW Competition: Multimodal Ambivalence/Hesitancy Recognition Approach Team LEYA提出多模态融合方法,用于解决非约束视频中的犹豫/矛盾情绪识别问题。 Mamba multimodal
19 GLEAM: A Multimodal Imaging Dataset and HAMM for Glaucoma Classification 提出GLEAM多模态青光眼数据集和HAMM模型用于青光眼分期分类 representation learning multimodal
20 Cheers: Decoupling Patch Details from Semantic Representations Enables Unified Multimodal Comprehension and Generation Cheers:解耦图像细节与语义表示,实现统一的多模态理解与生成 flow matching multimodal
21 Think and Answer ME: Benchmarking and Exploring Multi-Entity Reasoning Grounding in Remote Sensing 提出ME-RSRG基准数据集与EAR框架,解决遥感图像多实体推理与视觉定位问题 reinforcement learning foundation model visual grounding
22 CMHANet: A Cross-Modal Hybrid Attention Network for Point Cloud Registration CMHANet:用于点云配准的跨模态混合注意力网络,提升复杂场景下的鲁棒性。 contrastive learning scene understanding geometric consistency
23 Visual-ERM: Reward Modeling for Visual Equivalence 提出Visual-ERM,用于视觉等价的奖励建模,提升Vision-to-Code任务性能。 reinforcement learning multimodal
24 Out of Sight, Out of Mind? Evaluating State Evolution in Video World Models 提出STEVO-Bench以评估视频世界模型的状态演变能力 world model
25 Thinking in Streaming Video ThinkStream:提出基于观察-思考-表达范式的流式视频理解框架,解决实时性问题。 reinforcement learning multimodal
26 SGMatch: Semantic-Guided Non-Rigid Shape Matching with Flow Regularization SGMatch:语义引导的非刚性形状匹配与流正则化 flow matching foundation model

🔬 支柱三:空间感知与语义 (Perception & Semantics) (8 篇)

#题目一句话要点标签🔗
27 LR-SGS: Robust LiDAR-Reflectance-Guided Salient Gaussian Splatting for Self-Driving Scene Reconstruction 提出LR-SGS,利用LiDAR反射率引导的显著高斯溅射重建自动驾驶场景。 3D gaussian splatting 3DGS gaussian splatting
28 Spectral Defense Against Resource-Targeting Attack in 3D Gaussian Splatting 提出频谱防御机制,解决3D高斯溅射中资源耗尽攻击问题 3D gaussian splatting 3DGS gaussian splatting
29 Coherent Human-Scene Reconstruction from Multi-Person Multi-View Video in a Single Pass 提出CHROMM,单次处理多视角视频,实现一致的人体-场景重建。 scene reconstruction HMR human motion
30 Perceive What Matters: Relevance-Driven Scheduling for Multimodal Streaming Perception 提出基于相关性的多模态流式感知调度框架,提升人机协作效率。 scene understanding multimodal
31 Spectral-Geometric Neural Fields for Pose-Free LiDAR View Synthesis 提出SG-NLF,一种无位姿的LiDAR神经场方法,用于高质量视角合成。 NeRF neural radiance field scene reconstruction
32 NOIR: Neural Operator mapping for Implicit Representations NOIR:用于隐式表示的神经算子映射,解决医学图像任务中离散网格依赖问题 implicit representation
33 VFM-Recon: Unlocking Cross-Domain Scene-Level Neural Reconstruction with Scale-Aligned Foundation Priors VFM-Recon:利用尺度对齐的VFM先验实现跨域场景级神经重建 VGGT foundation model
34 Catalyst4D: High-Fidelity 3D-to-4D Scene Editing via Dynamic Propagation Catalyst4D:通过动态传播实现高保真3D到4D场景编辑 3DGS NeRF

🔬 支柱一:机器人控制 (Robot Control) (4 篇)

#题目一句话要点标签🔗
35 PVI: Plug-in Visual Injection for Vision-Language-Action Models 提出PVI,一种即插即用的视觉注入模块,提升VLA模型在语言条件下的操作能力。 manipulation bi-manual flow matching
36 RoboStereo: Dual-Tower 4D Embodied World Models for Unified Policy Optimization RoboStereo:双塔4D具身世界模型,用于统一策略优化,提升机器人操作性能。 manipulation policy learning world model
37 SAW: Toward a Surgical Action World Model via Controllable and Scalable Video Generation SAW:通过可控且可扩展的视频生成技术构建手术动作世界模型 sim-to-real world model affordance
38 Rethinking VLMs for Image Forgery Detection and Localization 提出IFDL-VLM,利用视觉语言模型提升图像篡改检测与定位性能 manipulation

🔬 支柱六:视频提取与匹配 (Video Extraction) (4 篇)

#题目一句话要点标签🔗
39 Reasoning over Video: Evaluating How MLLMs Extract, Integrate, and Reconstruct Spatiotemporal Evidence VAEX-BENCH:提出用于评估MLLM时空抽象推理能力的合成视频基准 egocentric spatiotemporal large language model
40 Do You See What I Am Pointing At? Gesture-Based Egocentric Video Question Answering 提出EgoPointVQA数据集以解决手势基础的自我中心视频问答问题 egocentric large language model multimodal
41 Rooftop Wind Field Reconstruction Using Sparse Sensors: From Deterministic to Generative Learning Methods 提出基于深度学习的屋顶风场重建方法,利用稀疏传感器数据提升无人机安全。 sparse sensors
42 CM-Bench: A Comprehensive Cross-Modal Feature Matching Benchmark Bridging Visible and Infrared Images 构建红外-可见光跨模态特征匹配基准CM-Bench,促进跨模态视觉应用 feature matching

🔬 支柱四:生成式动作 (Generative Motion) (2 篇)

#题目一句话要点标签🔗
43 InterEdit: Navigating Text-Guided Multi-Human 3D Motion Editing InterEdit:提出文本引导的多人3D动作编辑框架,并构建相应数据集。 text-to-motion human motion
44 TRACE: Structure-Aware Character Encoding for Robust and Generalizable Document Watermarking TRACE:提出一种结构感知的字符编码框架,用于文档水印嵌入,提升鲁棒性和泛化性。 MDM

🔬 支柱七:动作重定向 (Motion Retargeting) (2 篇)

#题目一句话要点标签🔗
45 SDF-Net: Structure-Aware Disentangled Feature Learning for Opticall-SAR Ship Re-identification SDF-Net:提出结构感知解耦特征学习网络,解决光电-SAR船舶重识别难题 geometric consistency
46 Spatial Reasoning is Not a Free Lunch: A Controlled Study on LLaVA 针对LLaVA的空间推理能力弱点,提出了一种受控诊断研究方法。 spatial relationship

⬅️ 返回 cs.CV 首页 · 🏠 返回主页