cs.CV（2026-03-11）

📊 共 32 篇论文 | 🔗 9 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (13 🔗3) 支柱二：RL算法与架构 (RL & Architecture) (7 🔗2) 支柱三：空间感知与语义 (Perception & Semantics) (6 🔗2) 支柱一：机器人控制 (Robot Control) (3 🔗1) 支柱八：物理动画 (Physics-based Animation) (1) 支柱四：生成式动作 (Generative Motion) (1 🔗1) 支柱六：视频提取与匹配 (Video Extraction) (1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (13 篇)

#	题目	一句话要点	标签	🔗	⭐
1	Fuel Gauge: Estimating Chain-of-Thought Length Ahead of Time in Large Multimodal Models	提出Fuel Gauge，提前预测大模型CoT长度，优化资源分配。	multimodal chain-of-thought
2	GeoSense: Internalizing Geometric Necessity Perception for Multimodal Reasoning	GeoSense：通过几何必要性感知增强多模态推理能力	large language model multimodal
3	Med-DualLoRA: Local Adaptation of Foundation Models for 3D Cardiac MRI	提出Med-DualLoRA以解决3D心脏MRI适应性问题	foundation model
4	Beyond Sequential Distance: Inter-Modal Distance Invariant Position Encoding	提出跨模态距离不变位置编码(DIPE)，缓解MLLM长文本场景中的视觉信息衰减问题。	large language model multimodal visual grounding	✅
5	UniCom: Unified Multimodal Modeling via Compressed Continuous Semantic Representations	UniCom：通过压缩连续语义表示实现统一的多模态建模	multimodal
6	RandMark: On Random Watermarking of Visual Foundation Models	RandMark：提出基于随机水印的视觉基础模型所有权验证方法	foundation model
7	Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation	评估骨骼CT分割中Promptable基础模型对人工提示的敏感性	foundation model	✅
8	GroundCount: Grounding Vision-Language Models with Object Detection for Mitigating Counting Hallucinations	GroundCount：利用目标检测增强视觉语言模型，缓解计数幻觉问题	symbolic grounding
9	Taking Shortcuts for Categorical VQA Using Super Neurons	利用超神经元，加速分类视觉问答任务	large language model
10	How To Embed Matters: Evaluation of EO Embedding Design Choices	系统评估地球观测嵌入设计选择，提升GeoFM在遥感任务中的性能与可扩展性。	foundation model
11	Towards Cognitive Defect Analysis in Active Infrared Thermography with Vision-Text Cues	提出基于视觉-语言模型的红外热成像认知缺陷分析框架，无需训练数据实现零样本缺陷检测。	multimodal
12	Fighting Hallucinations with Counterfactuals: Diffusion-Guided Perturbations for LVLM Hallucination Suppression	提出CIPHER，通过扩散引导的对抗扰动抑制LVLM的幻觉问题	multimodal	✅
13	Learning to Wander: Improving the Global Image Geolocation Ability of LMMs via Actionable Reasoning	提出GeoAoT框架，通过可执行推理提升LMMs的全局图像地理定位能力	multimodal

🔬 支柱二：RL算法与架构 (RL & Architecture) (7 篇)

#	题目	一句话要点	标签	🔗	⭐
14	Splat2Real: Novel-view Scaling for Physical AI with 3D Gaussian Splatting	Splat2Real：利用3D高斯溅射进行物理AI的新视角扩展	imitation learning monocular depth metric depth
15	SignSparK: Efficient Multilingual Sign Language Production via Sparse Keyframe Learning	SignSparK：通过稀疏关键帧学习实现高效的多语种手语生成	flow matching 3D gaussian splatting gaussian splatting
16	Lifelong Imitation Learning with Multimodal Latent Replay and Incremental Adjustment	提出多模态潜在回放与增量调整的终身模仿学习框架，提升策略持续优化能力。	imitation learning multimodal	✅
17	Pointy - A Lightweight Transformer for Point Cloud Foundation Models	提出轻量级Transformer Pointy，用于点云基础模型，在小数据集上实现卓越性能。	representation learning foundation model	✅
18	World2Act: Latent Action Post-Training via Skill-Compositional World Models	提出World2Act，通过技能组合世界模型进行后训练，提升具身智能体的泛化能力。	world model vision-language-action VLA
19	Less is More: Decoder-Free Masked Modeling for Efficient Skeleton Representation Learning	SLiM：通过无解码器掩码建模实现高效骨骼表示学习	representation learning MAE contrastive learning
20	Contrastive learning-based video quality assessment-jointed video vision transformer for video recognition	提出基于对比学习的视频质量评估联合视频视觉Transformer用于视频识别，提升低质量视频分类精度。	contrastive learning

🔬 支柱三：空间感知与语义 (Perception & Semantics) (6 篇)

#	题目	一句话要点	标签	🔗	⭐
21	PolGS++: Physically-Guided Polarimetric Gaussian Splatting for Fast Reflective Surface Reconstruction	提出PolGS++，通过物理引导的偏振高斯溅射实现快速反射表面重建	3D gaussian splatting 3DGS gaussian splatting
22	P-GSVC: Layered Progressive 2D Gaussian Splatting for Scalable Image and Video	提出P-GSVC，一种用于图像和视频可扩展高斯表示的分层渐进式2D高斯溅射框架	gaussian splatting splatting	✅
23	S2D: Sparse to Dense Lifting for 3D Reconstruction with Minimal Inputs	S2D：稀疏到稠密提升，以极少输入实现高质量3D重建	3D gaussian splatting 3DGS gaussian splatting
24	UAV traffic scene understanding: A cross-spectral guided approach and a unified benchmark	提出跨光谱引导的交通认知网络，用于无人机交通场景理解。	scene understanding	✅
25	UniStitch: Unifying Semantic and Geometric Features for Image Stitching	UniStitch：统一语义和几何特征的图像拼接框架	semantic map multimodal
26	WalkGPT: Grounded Vision-Language Conversation with Depth-Aware Segmentation for Pedestrian Navigation	WalkGPT：结合深度感知分割的视觉-语言对话模型，用于行人导航	depth estimation

🔬 支柱一：机器人控制 (Robot Control) (3 篇)

#	题目	一句话要点	标签	🔗	⭐
27	Overcoming Visual Clutter in Vision Language Action Models via Concept-Gated Visual Distillation	提出概念门控视觉蒸馏(CGVD)以提升VLA模型在复杂环境下的操作精度。	manipulation distillation vision-language-action
28	One Token, Two Fates: A Unified Framework via Vision Token Manipulation Against MLLMs Hallucination	提出基于视觉Token操作的统一框架，对抗多模态大语言模型的幻觉问题	manipulation
29	Layer Consistency Matters: Elegant Latent Transition Discrepancy for Generalizable Synthetic Image Detection	提出潜在过渡差异(LTD)方法，提升合成图像检测的泛化能力。	manipulation	✅

🔬 支柱八：物理动画 (Physics-based Animation) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
30	Frames2Residual: Spatiotemporal Decoupling for Self-Supervised Video Denoising	提出Frames2Residual框架，解耦时空信息，提升自监督视频降噪性能。	spatiotemporal

🔬 支柱四：生成式动作 (Generative Motion) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
31	Geometric Autoencoder for Diffusion Models	提出几何自编码器GAE，用于提升扩散模型的图像生成质量与效率。	classifier-free guidance foundation model	✅

🔬 支柱六：视频提取与匹配 (Video Extraction) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
32	COMIC: Agentic Sketch Comedy Generation	提出COMIC框架，通过智能体生成媲美专业水平的喜剧短视频	HuMoR

⬅️ 返回 cs.CV 首页 · 🏠 返回主页