cs.CV(2026-02-21)

📊 共 26 篇论文 | 🔗 8 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (13 🔗4) 支柱三:空间感知与语义 (Perception & Semantics) (5) 支柱二:RL算法与架构 (RL & Architecture) (3 🔗1) 支柱四:生成式动作 (Generative Motion) (3 🔗1) 支柱一:机器人控制 (Robot Control) (1 🔗1) 支柱六:视频提取与匹配 (Video Extraction) (1 🔗1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (13 篇)

#题目一句话要点标签🔗
1 FOCA: Frequency-Oriented Cross-Domain Forgery Detection, Localization and Explanation via Multi-Modal Large Language Model 提出FOCA:一种面向频率域的跨域伪造检测、定位与解释的多模态大语言模型。 large language model multimodal
2 SCHEMA for Gemini 3 Pro Image: A Structured Methodology for Controlled AI Image Generation on Google's Native Multimodal Model SCHEMA:为Gemini 3 Pro Image设计的可控AI图像生成结构化方法 multimodal
3 A high-resolution nationwide urban village mapping product for 342 Chinese cities based on foundation models 提出GeoLink-UV,基于基础模型构建中国342个城市高分辨率城中村地图产品。 foundation model
4 Optimizing ID Consistency in Multimodal Large Models: Facial Restoration via Alignment, Entanglement, and Disentanglement 提出EditedID框架,解决多模态编辑中人脸ID一致性难题 multimodal
5 Benchmarking Computational Pathology Foundation Models For Semantic Segmentation 提出计算病理学分割基准,评估并集成多个Foundation Model以提升组织病理图像语义分割性能。 foundation model
6 MIRROR: Multimodal Iterative Reasoning via Reflection on Visual Regions 提出MIRROR框架,通过视觉区域反思进行多模态迭代推理,提升视觉语言模型的正确性和减少幻觉。 multimodal
7 Synthesizing Multimodal Geometry Datasets from Scratch and Enabling Visual Alignment via Plotting Code 提出GeoCode数据集,通过代码预测实现视觉对齐,提升多模态几何推理能力。 multimodal
8 Echoes of Ownership: Adversarial-Guided Dual Injection for Copyright Protection in MLLMs 提出基于对抗引导的双重注入方法,用于多模态大语言模型的版权保护 large language model multimodal
9 HIME: Mitigating Object Hallucinations in LVLMs via Hallucination Insensitivity Model Editing 提出HIME:通过幻觉不敏感模型编辑缓解LVLM中的对象幻觉问题 large language model multimodal
10 LaS-Comp: Zero-shot 3D Completion with Latent-Spatial Consistency LaS-Comp:利用潜在空间一致性的零样本3D补全方法 foundation model
11 Frame2Freq: Spectral Adapters for Fine-Grained Video Understanding 提出Frame2Freq,通过频谱适配器提升视频细粒度理解能力 foundation model
12 TIACam: Text-Anchored Invariant Feature Learning with Auto-Augmentation for Camera-Robust Zero-Watermarking TIACam提出了一种文本锚定的不变特征学习框架,用于提升相机拍摄鲁棒性的零水印技术。 multimodal
13 Think with Grounding: Curriculum Reinforced Reasoning with Video Grounding for Long Video Understanding 提出Video-TwG,通过课程强化推理和视频定位提升长视频理解能力 multimodal

🔬 支柱三:空间感知与语义 (Perception & Semantics) (5 篇)

#题目一句话要点标签🔗
14 Open-Vocabulary Domain Generalization in Urban-Scene Segmentation 提出S2-Corr机制,解决城市场景分割中开放词汇域泛化问题 open-vocabulary open vocabulary
15 IRIS-SLAM: Unified Geo-Instance Representations for Robust Semantic Localization and Mapping IRIS-SLAM:利用统一几何实例表示实现鲁棒的语义定位与建图 semantic mapping semantic map open-vocabulary
16 Marginalized Bundle Adjustment: Multi-View Camera Pose from Monocular Depth Estimates 提出边缘化Bundle Adjustment,利用单目深度估计实现稳健的多视角相机位姿估计 depth estimation monocular depth
17 PhysConvex: Physics-Informed 3D Dynamic Convex Radiance Fields for Reconstruction and Simulation PhysConvex:基于物理信息的动态凸辐射场,用于三维重建与仿真 3DGS NeRF
18 Learning Multi-Modal Prototypes for Cross-Domain Few-Shot Object Detection 提出LMP:学习多模态原型,解决跨域小样本目标检测问题 open-vocabulary open vocabulary

🔬 支柱二:RL算法与架构 (RL & Architecture) (3 篇)

#题目一句话要点标签🔗
19 Rethinking Preference Alignment for Diffusion Models with Classifier-Free Guidance 提出基于Classifier-Free Guidance的扩散模型偏好对齐方法,无需重训练即可提升图像生成质量。 preference learning DPO direct preference optimization
20 TAG: Thinking with Action Unit Grounding for Facial Expression Recognition 提出TAG框架,通过动作单元AU引导视觉-语言模型进行面部表情识别,提升推理可靠性。 reinforcement learning multimodal
21 Neural Fields as World Models 提出基于神经场的世界模型,通过空间结构保留实现物理预测和策略学习。 world model

🔬 支柱四:生成式动作 (Generative Motion) (3 篇)

#题目一句话要点标签🔗
22 BiMotion: B-spline Motion for Text-guided Dynamic 3D Character Generation BiMotion:提出基于B样条运动表示的文本引导动态3D角色生成方法 motion generation
23 Beyond Stationarity: Rethinking Codebook Collapse in Vector Quantization 针对向量量化中码本崩溃问题,提出非平稳向量量化和Transformer向量量化方法 VQ-VAE
24 Spatial-Temporal State Propagation Autoregressive Model for 4D Object Generation 提出4DSTAR模型,通过时空状态传播自回归生成时空一致的4D物体 VQ-VAE

🔬 支柱一:机器人控制 (Robot Control) (1 篇)

#题目一句话要点标签🔗
25 HeRO: Hierarchical 3D Semantic Representation for Pose-aware Object Manipulation HeRO:用于姿态感知物体操作的分层3D语义表示 manipulation imitation learning

🔬 支柱六:视频提取与匹配 (Video Extraction) (1 篇)

#题目一句话要点标签🔗
26 Global Commander and Local Operative: A Dual-Agent Framework for Scene Navigation 提出DACo双智能体框架,解耦全局规划与局部执行,提升场景导航性能。 egocentric

⬅️ 返回 cs.CV 首页 · 🏠 返回主页