cs.CV(2026-03-11)

📊 共 32 篇论文 | 🔗 9 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (13 🔗3) 支柱二:RL算法与架构 (RL & Architecture) (7 🔗2) 支柱三:空间感知与语义 (Perception & Semantics) (6 🔗2) 支柱一:机器人控制 (Robot Control) (3 🔗1) 支柱八:物理动画 (Physics-based Animation) (1) 支柱四:生成式动作 (Generative Motion) (1 🔗1) 支柱六:视频提取与匹配 (Video Extraction) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (13 篇)

#题目一句话要点标签🔗
1 Fuel Gauge: Estimating Chain-of-Thought Length Ahead of Time in Large Multimodal Models 提出Fuel Gauge,提前预测大模型CoT长度,优化资源分配。 multimodal chain-of-thought
2 GeoSense: Internalizing Geometric Necessity Perception for Multimodal Reasoning GeoSense:通过几何必要性感知增强多模态推理能力 large language model multimodal
3 Med-DualLoRA: Local Adaptation of Foundation Models for 3D Cardiac MRI 提出Med-DualLoRA以解决3D心脏MRI适应性问题 foundation model
4 Beyond Sequential Distance: Inter-Modal Distance Invariant Position Encoding 提出跨模态距离不变位置编码(DIPE),缓解MLLM长文本场景中的视觉信息衰减问题。 large language model multimodal visual grounding
5 UniCom: Unified Multimodal Modeling via Compressed Continuous Semantic Representations UniCom:通过压缩连续语义表示实现统一的多模态建模 multimodal
6 RandMark: On Random Watermarking of Visual Foundation Models RandMark:提出基于随机水印的视觉基础模型所有权验证方法 foundation model
7 Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation 评估骨骼CT分割中Promptable基础模型对人工提示的敏感性 foundation model
8 GroundCount: Grounding Vision-Language Models with Object Detection for Mitigating Counting Hallucinations GroundCount:利用目标检测增强视觉语言模型,缓解计数幻觉问题 symbolic grounding
9 Taking Shortcuts for Categorical VQA Using Super Neurons 利用超神经元,加速分类视觉问答任务 large language model
10 How To Embed Matters: Evaluation of EO Embedding Design Choices 系统评估地球观测嵌入设计选择,提升GeoFM在遥感任务中的性能与可扩展性。 foundation model
11 Towards Cognitive Defect Analysis in Active Infrared Thermography with Vision-Text Cues 提出基于视觉-语言模型的红外热成像认知缺陷分析框架,无需训练数据实现零样本缺陷检测。 multimodal
12 Fighting Hallucinations with Counterfactuals: Diffusion-Guided Perturbations for LVLM Hallucination Suppression 提出CIPHER,通过扩散引导的对抗扰动抑制LVLM的幻觉问题 multimodal
13 Learning to Wander: Improving the Global Image Geolocation Ability of LMMs via Actionable Reasoning 提出GeoAoT框架,通过可执行推理提升LMMs的全局图像地理定位能力 multimodal

🔬 支柱二:RL算法与架构 (RL & Architecture) (7 篇)

#题目一句话要点标签🔗
14 Splat2Real: Novel-view Scaling for Physical AI with 3D Gaussian Splatting Splat2Real:利用3D高斯溅射进行物理AI的新视角扩展 imitation learning monocular depth metric depth
15 SignSparK: Efficient Multilingual Sign Language Production via Sparse Keyframe Learning SignSparK:通过稀疏关键帧学习实现高效的多语种手语生成 flow matching 3D gaussian splatting gaussian splatting
16 Lifelong Imitation Learning with Multimodal Latent Replay and Incremental Adjustment 提出多模态潜在回放与增量调整的终身模仿学习框架,提升策略持续优化能力。 imitation learning multimodal
17 Pointy - A Lightweight Transformer for Point Cloud Foundation Models 提出轻量级Transformer Pointy,用于点云基础模型,在小数据集上实现卓越性能。 representation learning foundation model
18 World2Act: Latent Action Post-Training via Skill-Compositional World Models 提出World2Act,通过技能组合世界模型进行后训练,提升具身智能体的泛化能力。 world model vision-language-action VLA
19 Less is More: Decoder-Free Masked Modeling for Efficient Skeleton Representation Learning SLiM:通过无解码器掩码建模实现高效骨骼表示学习 representation learning MAE contrastive learning
20 Contrastive learning-based video quality assessment-jointed video vision transformer for video recognition 提出基于对比学习的视频质量评估联合视频视觉Transformer用于视频识别,提升低质量视频分类精度。 contrastive learning

🔬 支柱三:空间感知与语义 (Perception & Semantics) (6 篇)

#题目一句话要点标签🔗
21 PolGS++: Physically-Guided Polarimetric Gaussian Splatting for Fast Reflective Surface Reconstruction 提出PolGS++,通过物理引导的偏振高斯溅射实现快速反射表面重建 3D gaussian splatting 3DGS gaussian splatting
22 P-GSVC: Layered Progressive 2D Gaussian Splatting for Scalable Image and Video 提出P-GSVC,一种用于图像和视频可扩展高斯表示的分层渐进式2D高斯溅射框架 gaussian splatting splatting
23 S2D: Sparse to Dense Lifting for 3D Reconstruction with Minimal Inputs S2D:稀疏到稠密提升,以极少输入实现高质量3D重建 3D gaussian splatting 3DGS gaussian splatting
24 UAV traffic scene understanding: A cross-spectral guided approach and a unified benchmark 提出跨光谱引导的交通认知网络,用于无人机交通场景理解。 scene understanding
25 UniStitch: Unifying Semantic and Geometric Features for Image Stitching UniStitch:统一语义和几何特征的图像拼接框架 semantic map multimodal
26 WalkGPT: Grounded Vision-Language Conversation with Depth-Aware Segmentation for Pedestrian Navigation WalkGPT:结合深度感知分割的视觉-语言对话模型,用于行人导航 depth estimation

🔬 支柱一:机器人控制 (Robot Control) (3 篇)

#题目一句话要点标签🔗
27 Overcoming Visual Clutter in Vision Language Action Models via Concept-Gated Visual Distillation 提出概念门控视觉蒸馏(CGVD)以提升VLA模型在复杂环境下的操作精度。 manipulation distillation vision-language-action
28 One Token, Two Fates: A Unified Framework via Vision Token Manipulation Against MLLMs Hallucination 提出基于视觉Token操作的统一框架,对抗多模态大语言模型的幻觉问题 manipulation
29 Layer Consistency Matters: Elegant Latent Transition Discrepancy for Generalizable Synthetic Image Detection 提出潜在过渡差异(LTD)方法,提升合成图像检测的泛化能力。 manipulation

🔬 支柱八:物理动画 (Physics-based Animation) (1 篇)

#题目一句话要点标签🔗
30 Frames2Residual: Spatiotemporal Decoupling for Self-Supervised Video Denoising 提出Frames2Residual框架,解耦时空信息,提升自监督视频降噪性能。 spatiotemporal

🔬 支柱四:生成式动作 (Generative Motion) (1 篇)

#题目一句话要点标签🔗
31 Geometric Autoencoder for Diffusion Models 提出几何自编码器GAE,用于提升扩散模型的图像生成质量与效率。 classifier-free guidance foundation model

🔬 支柱六:视频提取与匹配 (Video Extraction) (1 篇)

#题目一句话要点标签🔗
32 COMIC: Agentic Sketch Comedy Generation 提出COMIC框架,通过智能体生成媲美专业水平的喜剧短视频 HuMoR

⬅️ 返回 cs.CV 首页 · 🏠 返回主页