cs.CV(2026-03-18)

📊 共 54 篇论文 | 🔗 14 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (19 🔗6) 支柱二:RL算法与架构 (RL & Architecture) (14 🔗3) 支柱三:空间感知与语义 (Perception & Semantics) (10 🔗3) 支柱一:机器人控制 (Robot Control) (4) 支柱四:生成式动作 (Generative Motion) (2 🔗1) 支柱六:视频提取与匹配 (Video Extraction) (2 🔗1) 支柱七:动作重定向 (Motion Retargeting) (2) 支柱八:物理动画 (Physics-based Animation) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (19 篇)

#题目一句话要点标签🔗
1 Temporal Gains, Spatial Costs: Revisiting Video Fine-Tuning in Multimodal Large Language Models 揭示多模态大语言模型中视频微调的空间代价与时间收益权衡 large language model multimodal
2 MCoT-MVS: Multi-level Vision Selection by Multi-modal Chain-of-Thought Reasoning for Composed Image Retrieval 提出MCoT-MVS,通过多模态CoT推理实现组合图像检索中的精准视觉选择。 large language model multimodal chain-of-thought
3 Parameter-Efficient Modality-Balanced Symmetric Fusion for Multimodal Remote Sensing Semantic Segmentation 提出MoBaNet以解决多模态遥感语义分割中的模态不平衡问题 foundation model multimodal
4 EI: Early Intervention for Multimodal Imaging based Disease Recognition 提出EI框架,通过早期干预和MoR自适应,提升多模态医学影像疾病识别精度。 foundation model multimodal
5 Revisiting foundation models for cell instance segmentation 针对细胞实例分割,论文评估并改进了基于SAM的多个Foundation Model foundation model
6 UniSAFE: A Comprehensive Benchmark for Safety Evaluation of Unified Multimodal Models UniSAFE:用于统一多模态模型安全性评估的综合基准 multimodal
7 Harnessing the Power of Foundation Models for Accurate Material Classification 提出一种利用Foundation Model的材料分类框架,解决数据稀缺问题并提升分类精度。 foundation model
8 A Proposal-Free Query-Guided Network for Grounded Multimodal Named Entity Recognition 提出无建议框的查询引导网络QGN,解决GMNER中检测器与实体不匹配问题。 multimodal
9 Concept-to-Pixel: Prompt-Free Universal Medical Image Segmentation 提出Concept-to-Pixel框架以解决医学图像分割的自动化与鲁棒性问题 large language model multimodal
10 FineViT: Progressively Unlocking Fine-Grained Perception with Dense Recaptions FineViT:通过密集重述解锁细粒度感知,提升视觉编码器性能 large language model multimodal
11 LED: A Benchmark for Evaluating Layout Error Detection in Document Analysis 提出LED基准,用于评估文档分析中版面错误检测的结构推理能力。 large language model multimodal
12 From Drop-off to Recovery: A Mechanistic Analysis of Segmentation in MLLMs 揭示MLLM图像分割机理:分析视觉编码、适配器与LLM层间的交互作用 large language model multimodal
13 The Unreasonable Effectiveness of Text Embedding Interpolation for Continuous Image Steering 提出一种免训练的文本嵌入插值方法,实现对文本条件生成图像的连续控制。 large language model
14 VideoAtlas: Navigating Long-Form Video in Logarithmic Compute 提出VideoAtlas,以对长视频进行对数计算复杂度的导航和理解。 multimodal
15 Interpretable Traffic Responsibility from Dashcam Video via Legal Multi Agent Reasoning 提出C-TRAIL数据集和多智能体法律推理框架,从行车记录仪视频中自动判定交通事故责任 multimodal
16 Edit Spillover as a Probe: Do Image Editing Models Implicitly Understand World Relations? 提出EditSpilloverProbe,用于评估图像编辑模型对世界关系的隐式理解能力 instruction following
17 Fine-Grained Post-Training Quantization for Large Vision Language Models with Quantization-Aware Integrated Gradients 提出基于量化感知积分梯度的细粒度后训练量化方法,提升大视觉语言模型量化性能。 multimodal
18 Eye image segmentation using visual and concept prompts with Segment Anything Model 3 (SAM3) 评估SAM3在眼部图像分割任务中的性能,并与SAM2对比。 foundation model
19 Omni-I2C: A Holistic Benchmark for High-Fidelity Image-to-Code Generation 提出Omni-I2C基准,用于评估大模型将图像转换为可执行代码的能力 multimodal

🔬 支柱二:RL算法与架构 (RL & Architecture) (14 篇)

#题目一句话要点标签🔗
20 GigaWorld-Policy: An Efficient Action-Centered World--Action Model GigaWorld-Policy:一种高效的以动作为中心的World-Action模型,加速机器人策略学习。 policy learning physically plausible motion prediction
21 M2P: Improving Visual Foundation Models with Mask-to-Point Weakly-Supervised Learning for Dense Point Tracking 提出M2P:通过Mask-to-Point弱监督学习提升视觉基础模型,用于密集点追踪 representation learning foundation model
22 DeepCORO-CLIP: A Multi-View Foundation Model for Comprehensive Coronary Angiography Video-Text Analysis and External Validation DeepCORO-CLIP:用于冠状动脉造影视频-文本分析的多视角基础模型 contrastive learning foundation model
23 Stereo World Model: Camera-Guided Stereo Video Generation 提出StereoWorld,一种相机引导的立体世界模型,用于端到端立体视频生成。 policy learning world model distillation
24 FINER: MLLMs Hallucinate under Fine-grained Negative Queries 提出FINER以解决多模态大语言模型的幻觉问题 DPO direct preference optimization large language model
25 Recurrent Reasoning with Vision-Language Models for Estimating Long-Horizon Embodied Task Progress 提出R²VLM,利用循环推理和视觉-语言模型解决长时程具身任务进度估计问题。 reinforcement learning policy learning Ego4D
26 Universal Skeleton Understanding via Differentiable Rendering and MLLMs SkeletonLLM:通过可微渲染和MLLM实现通用骨骼理解 distillation large language model multimodal
27 EvoGuard: An Extensible Agentic RL-based Framework for Practical and Evolving AI-Generated Image Detection EvoGuard:一种基于Agentic RL的可扩展框架,用于检测不断演进的AI生成图像 reinforcement learning large language model multimodal
28 Adaptive Anchor Policies for Efficient 4D Gaussian Streaming 提出高效的锚点策略以解决4D高斯流媒体问题 reinforcement learning gaussian splatting splatting
29 Mutually Causal Semantic Distillation Network for Zero-Shot Learning 提出互因果语义蒸馏网络MSDN++,提升零样本学习的语义知识迁移能力 distillation mutual attention
30 AR-CoPO: Align Autoregressive Video Generation with Contrastive Policy Optimization 提出AR-CoPO,通过对比策略优化对齐自回归视频生成与人类反馈 reinforcement learning flow matching RLHF
31 AdapTS: Lightweight Teacher-Student Approach for Multi-Class and Continual Visual Anomaly Detection AdapTS:轻量级教师-学生框架,用于多类别和持续视觉异常检测 teacher-student
32 Towards Motion-aware Referring Image Segmentation 提出运动感知指代图像分割方法,解决现有方法在运动相关查询上的性能瓶颈。 contrastive learning multimodal
33 Directing the Narrative: A Finetuning Method for Controlling Coherence and Style in Story Generation 提出基于GSA和DPO的精调方法,用于控制故事生成中的一致性和风格 DPO direct preference optimization

🔬 支柱三:空间感知与语义 (Perception & Semantics) (10 篇)

#题目一句话要点标签🔗
34 Part-Aware Open-Vocabulary 3D Affordance Grounding via Prototypical Semantic and Geometric Alignment 提出原型语义与几何对齐方法,解决开放词汇3D可供性定位问题 open-vocabulary open vocabulary affordance
35 MM-OVSeg:Multimodal Optical-SAR Fusion for Open-Vocabulary Segmentation in Remote Sensing 提出MM-OVSeg,用于恶劣天气下遥感影像的多模态开放词汇分割 open-vocabulary open vocabulary foundation model
36 ReLaGS: Relational Language Gaussian Splatting 提出ReLaGS框架以解决统一3D感知与推理问题 gaussian splatting splatting open-vocabulary
37 Feeling the Space: Egomotion-Aware Video Representation for Efficient and Accurate 3D Scene Understanding Motion-MLLM:利用运动信息增强多模态大模型,实现高效精准的3D场景理解 scene understanding spatial relationship large language model
38 S-VGGT: Structure-Aware Subscene Decomposition for Scalable 3D Foundation Models 提出S-VGGT,通过结构感知的子场景分解,提升3D基础模型的可扩展性。 VGGT foundation model
39 AHOY! Animatable Humans under Occlusion from YouTube Videos with Gaussian Splatting and Video Diffusion Priors AHOY:利用高斯溅射和视频扩散先验,从YouTube视频中重建遮挡下可动画的人体 3DGS gaussian splatting splatting
40 UniSem: Generalizable Semantic 3D Reconstruction from Sparse Unposed Images UniSem:从稀疏无位姿图像中实现可泛化的语义3D重建 depth estimation 3D gaussian splatting 3DGS
41 Edit-As-Act: Goal-Regressive Planning for Open-Vocabulary 3D Indoor Scene Editing Edit-As-Act:面向开放词汇3D室内场景编辑的目标回溯规划 open-vocabulary open vocabulary
42 PCA-Seg: Revisiting Cost Aggregation for Open-Vocabulary Semantic and Part Segmentation 提出PCA-Seg并行代价聚合方法,解决开放词汇语义和部件分割中的知识干扰问题。 open-vocabulary open vocabulary
43 CrowdGaussian: Reconstructing High-Fidelity 3D Gaussians for Human Crowd from a Single Image CrowdGaussian:提出单图重建人群高保真3D高斯模型方法 3D gaussian splatting 3DGS gaussian splatting

🔬 支柱一:机器人控制 (Robot Control) (4 篇)

#题目一句话要点标签🔗
44 GMT: Goal-Conditioned Multimodal Transformer for 6-DOF Object Trajectory Synthesis in 3D Scenes 提出GMT框架以解决3D场景中6-DOF物体轨迹合成问题 manipulation scene understanding human-object interaction
45 Noise-Aware Misclassification Attack Detection in Collaborative DNN Inference 提出噪声感知的VAE框架,用于检测协同DNN推理中的恶意数据注入攻击。 manipulation
46 Omni-3DEdit: Generalized Versatile 3D Editing in One-Pass Omni-3DEdit:提出一种通用、单次完成的3D编辑框架,解决传统方法效率低和任务依赖问题。 manipulation
47 Rel-Zero: Harnessing Patch-Pair Invariance for Robust Zero-Watermarking Against AI Editing 提出Rel-Zero,利用图像块对关系不变性实现对AI编辑的鲁棒零水印 manipulation

🔬 支柱四:生成式动作 (Generative Motion) (2 篇)

#题目一句话要点标签🔗
48 Toward Phonology-Guided Sign Language Motion Generation: A Diffusion Baseline and Conditioning Analysis 提出基于扩散模型的音系学引导手语动作生成方法,显著提升生成质量。 MDM motion generation SMPL
49 OnlineHMR: Video-based Online World-Grounded Human Mesh Recovery 提出OnlineHMR,解决视频中在线世界坐标系下人体网格重建问题 physically plausible human mesh recovery HMR

🔬 支柱六:视频提取与匹配 (Video Extraction) (2 篇)

#题目一句话要点标签🔗
50 Loc3R-VLM: Language-based Localization and 3D Reasoning with Vision-Language Models 提出Loc3R-VLM以解决语言基础的3D定位与推理问题 egocentric geometric consistency large language model
51 Gesture-Aware Pretraining and Token Fusion for 3D Hand Pose Estimation 提出手势感知预训练与Token融合,提升单目图像3D手部姿态估计精度 MANO

🔬 支柱七:动作重定向 (Motion Retargeting) (2 篇)

#题目一句话要点标签🔗
52 PC-CrossDiff: Point-Cluster Dual-Level Cross-Modal Differential Attention for Unified 3D Referring and Segmentation PC-CrossDiff:面向统一3D指代与分割任务的点云-簇双层跨模态差分注意力机制 spatial relationship visual grounding
53 3D MRI-Based Alzheimer's Disease Classification Using Multi-Modal 3D CNN with Leakage-Aware Subject-Level Evaluation 提出一种基于多模态3D CNN的阿尔茨海默病MRI分类方法,提升诊断准确率。 spatial relationship multimodal

🔬 支柱八:物理动画 (Physics-based Animation) (1 篇)

#题目一句话要点标签🔗
54 Video Understanding: From Geometry and Semantics to Unified Models 视频理解综述:从几何与语义到统一模型,探索时空推理与动态视觉上下文建模 spatiotemporal foundation model

⬅️ 返回 cs.CV 首页 · 🏠 返回主页