cs.CV(2026-03-17)

📊 共 54 篇论文 | 🔗 17 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (19 🔗8) 支柱二:RL算法与架构 (RL & Architecture) (14 🔗3) 支柱三:空间感知与语义 (Perception & Semantics) (10 🔗3) 支柱一:机器人控制 (Robot Control) (5 🔗1) 支柱四:生成式动作 (Generative Motion) (2) 支柱八:物理动画 (Physics-based Animation) (2 🔗1) 支柱七:动作重定向 (Motion Retargeting) (1 🔗1) 支柱六:视频提取与匹配 (Video Extraction) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (19 篇)

#题目一句话要点标签🔗
1 GAP-MLLM: Geometry-Aligned Pre-training for Activating 3D Spatial Perception in Multimodal Large Language Models 提出GAP-MLLM,通过几何对齐预训练提升多模态大语言模型3D空间感知能力 large language model multimodal visual grounding
2 VisBrowse-Bench: Benchmarking Visual-Native Search for Multimodal Browsing Agents VisBrowse-Bench:用于多模态浏览代理的视觉原生搜索基准 large language model multimodal
3 When Thinking Hurts: Mitigating Visual Forgetting in Video Reasoning via Frame Repetition 提出FrameRepeat框架,通过帧重复缓解视频推理中视觉信息遗忘问题 large language model multimodal chain-of-thought
4 KidsNanny: A Two-Stage Multimodal Content Moderation Pipeline Integrating Visual Classification, Object Detection, OCR, and Contextual Reasoning for Child Safety KidsNanny:用于儿童安全的双阶段多模态内容审核流水线 multimodal
5 Fast-WAM: Do World Action Models Need Test-time Future Imagination? 提出Fast-WAM,无需测试时未来想象,加速具身控制任务。 vision-language-action VLA
6 Kestrel: Grounding Self-Refinement for LVLM Hallucination Mitigation Kestrel:提出基于视觉 grounding 和自精炼的 LVLM 幻觉缓解框架 multimodal visual grounding
7 MLLM-based Textual Explanations for Face Comparison 分析MLLM在人脸比对中生成解释的可靠性,揭示其幻觉问题 large language model multimodal
8 InViC: Intent-aware Visual Cues for Medical Visual Question Answering 提出InViC框架,通过意图感知视觉线索提升医学VQA中图像理解能力。 large language model multimodal
9 360° Image Perception with MLLMs: A Comprehensive Benchmark and a Training-Free Method 提出Free360,一种无需训练的360°图像VQA框架,提升MLLM在全景图像理解能力。 large language model multimodal
10 What DINO saw: ALiBi positional encoding reduces positional bias in Vision Transformers 使用ALiBi位置编码减少Vision Transformer中的位置偏差,提升零样本迁移能力 foundation model
11 Retrieving Counterfactuals Improves Visual In-Context Learning 提出CIRCLES框架,通过检索反事实样本提升视觉上下文学习能力 multimodal
12 World Reconstruction From Inconsistent Views 提出一种非刚性对齐方法,从不一致的视频帧中重建3D世界。 foundation model
13 BUSSARD: Normalizing Flows for Bijective Universal Scene-Specific Anomalous Relationship Detection 提出BUSSARD,用标准化流进行双射通用场景特定异常关系检测 multimodal
14 VIEW2SPACE: Studying Multi-View Visual Reasoning from Sparse Observations 提出VIEW2SPACE基准,研究稀疏视角下的多视角视觉推理,并提出Grounded Chain-of-Thought方法。 chain-of-thought
15 Cross-modal learning for plankton recognition 提出基于跨模态自监督学习的浮游生物识别方法,利用图像和光学测量数据提升识别精度。 multimodal
16 Persistent Story World Simulation with Continuous Character Customization EverTale:提出持续角色定制的故事世界模拟器,解决角色一致性与场景融合问题。 chain-of-thought
17 Grounding the Score: Explicit Visual Premise Verification for Reliable Vision-Language Process Reward Models 提出EVPV,通过显式视觉前提验证提升视觉-语言过程奖励模型的可靠性。 multimodal
18 Rethinking UMM Visual Generation: Masked Modeling for Efficient Image-Only Pre-training 提出IOMM:通过图像掩码建模实现高效的UMM视觉生成预训练 multimodal
19 Boosting Quantitive and Spatial Awareness for Zero-Shot Object Counting 提出QICA框架,提升零样本物体计数中的数量感知和空间感知能力 zero-shot transfer

🔬 支柱二:RL算法与架构 (RL & Architecture) (14 篇)

#题目一句话要点标签🔗
20 HGP-Mamba: Integrating Histology and Generated Protein Features for Mamba-based Multimodal Survival Risk Prediction HGP-Mamba:融合组织学和生成蛋白特征,用于基于Mamba的多模态生存风险预测 Mamba foundation model multimodal
21 Iris: Bringing Real-World Priors into Diffusion Model for Monocular Depth Estimation Iris:将真实世界先验知识融入单目深度估计扩散模型 distillation depth estimation monocular depth
22 Learning Human-Object Interaction for 3D Human Pose Estimation from LiDAR Point Clouds 提出HOIL框架,利用人-物交互学习提升LiDAR点云3D人体姿态估计精度 contrastive learning contact-aware human-object interaction
23 Evo-Retriever: LLM-Guided Curriculum Evolution with Viewpoint-Pathway Collaboration for Multimodal Document Retrieval Evo-Retriever:基于LLM引导的课程演化和视角-路径协作的多模态文档检索 contrastive learning multimodal
24 ViT-AdaLA: Adapting Vision Transformers with Linear Attention ViT-AdaLA:通过线性注意力自适应视觉Transformer,解决长序列扩展性问题。 linear attention large language model foundation model
25 Deep Reinforcement Learning-driven Edge Offloading for Latency-constrained XR pipelines 提出基于深度强化学习的边缘卸载框架,优化时延约束XR应用 reinforcement learning deep reinforcement learning
26 SF-Mamba: Rethinking State Space Model for Vision SF-Mamba:面向视觉任务,通过辅助patch交换和批量折叠,提升Mamba模型的效率和性能。 Mamba state space model
27 Reliable Reasoning in SVG-LLMs via Multi-Task Multi-Reward Reinforcement Learning 提出CTRL-S框架,通过多任务多奖励强化学习提升SVG-LLMs的推理可靠性。 reinforcement learning chain-of-thought
28 Fast-HaMeR: Boosting Hand Mesh Reconstruction using Knowledge Distillation Fast-HaMeR:利用知识蒸馏加速手部网格重建,适用于资源受限设备。 distillation hand reconstruction
29 Rationale Matters: Learning Transferable Rubrics via Proxy-Guided Critique for VLMReward Models 提出Proxy-GRM,通过代理引导的评价标准学习提升视觉-语言模型奖励模型的性能。 reinforcement learning RLHF multimodal
30 RASLF: Representation-Aware State Space Model for Light Field Super-Resolution RASLF:提出表征感知状态空间模型,用于光场超分辨率重建 SSM state space model
31 Micro-AU CLIP: Fine-Grained Contrastive Learning from Local Independence to Global Dependency for Micro-Expression Action Unit Detection 提出Micro-AU CLIP框架,解决微表情动作单元检测中局部独立性和全局依赖性建模问题。 contrastive learning
32 VIGOR: VIdeo Geometry-Oriented Reward for Temporal Generative Alignment VIGOR:面向视频几何一致性的时序生成对齐方法,提升视频扩散模型质量 reinforcement learning foundation model
33 EFF-Grasp: Energy-Field Flow Matching for Physics-Aware Dexterous Grasp Generation EFF-Grasp:基于能量场流匹配的物理感知灵巧抓取生成 flow matching

🔬 支柱三:空间感知与语义 (Perception & Semantics) (10 篇)

#题目一句话要点标签🔗
34 M^3: Dense Matching Meets Multi-View Foundation Models for Monocular Gaussian Splatting SLAM M^3:融合多视角基础模型与稠密匹配的单目高斯溅射SLAM gaussian splatting splatting scene reconstruction
35 Leveling3D: Leveling Up 3D Reconstruction with Feed-Forward 3D Gaussian Splatting and Geometry-Aware Generation Leveling3D:结合前馈3D高斯溅射与几何感知生成,提升3D重建质量 depth estimation 3D gaussian splatting 3DGS
36 Rethinking Pose Refinement in 3D Gaussian Splatting under Pose Prior and Geometric Uncertainty 提出基于蒙特卡洛采样的3DGS位姿优化方法,提升位姿先验和几何不确定性下的鲁棒性 3D gaussian splatting 3DGS gaussian splatting
37 MessyKitchens: Contact-rich object-level 3D scene reconstruction 提出MessyKitchens数据集,并设计MOD网络用于接触丰富的物体级3D场景重建 depth estimation scene reconstruction sam 3D
38 WildDepth: A Multimodal Dataset for 3D Wildlife Perception and Depth Estimation WildDepth:用于3D野生动物感知和深度估计的多模态数据集 depth estimation multimodal
39 $D^3$-RSMDE: 40$\times$ Faster and High-Fidelity Remote Sensing Monocular Depth Estimation 提出D³-RSMDE框架,加速40倍并提升遥感单目深度估计质量 depth estimation monocular depth
40 PureCLIP-Depth: Prompt-Free and Decoder-Free Monocular Depth Estimation within CLIP Embedding Space PureCLIP-Depth:提出一种完全无提示、无解码器的CLIP嵌入空间单目深度估计模型 depth estimation monocular depth
41 $x^2$-Fusion: Cross-Modality and Cross-Dimension Flow Estimation in Event Edge Space 提出$x^2$-Fusion,通过事件边缘空间统一多模态特征,提升动态场景光流和场景流估计精度。 scene understanding optical flow scene flow
42 VideoMatGen: PBR Materials through Joint Generative Modeling VideoMatGen:提出基于视频扩散Transformer的PBR材质联合生成方法 height map physically plausible
43 NanoGS: Training-Free Gaussian Splat Simplification NanoGS:一种免训练的高斯溅射简化框架,降低存储和传输成本。 3DGS

🔬 支柱一:机器人控制 (Robot Control) (5 篇)

#题目一句话要点标签🔗
44 ECHO: Edge-Cloud Humanoid Orchestration for Language-to-Motion Control ECHO:面向语言驱动的人形机器人全身控制的边缘-云协同框架 humanoid humanoid robot whole-body control
45 Ground Reaction Inertial Poser: Physics-based Human Motion Capture from Sparse IMUs and Insole Pressure Sensors GRIP:结合稀疏IMU与压力传感器的物理可信人体运动捕捉 humanoid physically plausible human motion
46 Segmentation-Based Attention Entropy: Detecting and Mitigating Object Hallucinations in Large Vision-Language Models 提出基于分割注意力熵(SAE)的方法,用于检测和缓解大型视觉语言模型中的对象幻觉问题。 quadruped multimodal visual grounding
47 S-VAM: Shortcut Video-Action Model by Self-Distilling Geometric and Semantic Foresight 提出S-VAM以解决视频动作模型实时推理与高保真预见问题 manipulation distillation foundation model
48 Demystifing Video Reasoning 揭示视频生成模型中的推理机制:扩散去噪过程而非帧序列是关键 manipulation

🔬 支柱四:生成式动作 (Generative Motion) (2 篇)

#题目一句话要点标签🔗
49 Interact3D: Compositional 3D Generation of Interactive Objects Interact3D:用于交互对象组合式3D生成,解决遮挡和空间关系保持问题 physically plausible spatial relationship
50 V-Co: A Closer Look at Visual Representation Alignment via Co-Denoising V-Co:通过协同去噪更深入地研究视觉表征对齐,提升像素空间扩散模型性能。 classifier-free guidance

🔬 支柱八:物理动画 (Physics-based Animation) (2 篇)

#题目一句话要点标签🔗
51 Point-to-Mask: From Arbitrary Point Annotations to Mask-Level Infrared Small Target Detection 提出Point-to-Mask框架,以低成本点标注实现红外小目标mask级检测。 spatiotemporal
52 3D tomography of exchange phase in a Si/SiGe quantum dot device 提出3D相位体积提取方法以解决量子点设备中的交换相互作用问题 PULSE

🔬 支柱七:动作重定向 (Motion Retargeting) (1 篇)

#题目一句话要点标签🔗
53 OneWorld: Taming Scene Generation with 3D Unified Representation Autoencoder OneWorld:提出3D统一表示自编码器,提升三维场景生成跨视角一致性。 geometric consistency foundation model

🔬 支柱六:视频提取与匹配 (Video Extraction) (1 篇)

#题目一句话要点标签🔗
54 SOMA: Unifying Parametric Human Body Models SOMA:统一参数化人体模型,实现跨模型数据融合与应用 SMPL SMPL-X

⬅️ 返回 cs.CV 首页 · 🏠 返回主页