cs.CV（2026-03-27）

📊 共 48 篇论文 | 🔗 9 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (16 🔗4) 支柱二：RL算法与架构 (RL & Architecture) (14 🔗2) 支柱三：空间感知与语义 (Perception & Semantics) (9 🔗1) 支柱一：机器人控制 (Robot Control) (3 🔗1) 支柱六：视频提取与匹配 (Video Extraction) (2) 支柱八：物理动画 (Physics-based Animation) (2 🔗1) 支柱四：生成式动作 (Generative Motion) (1) 支柱七：动作重定向 (Motion Retargeting) (1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (16 篇)

#	题目	一句话要点	标签	🔗	⭐
1	Reflect to Inform: Boosting Multimodal Reasoning via Information-Gain-Driven Verification	提出Visual Re-Examination (VRE)框架，提升多模态LLM的视觉推理能力并减少幻觉	large language model multimodal	✅
2	SALMUBench: A Benchmark for Sensitive Association-Level Multimodal Unlearning	SALMUBench：用于敏感关联级别多模态模型卸载的基准测试	multimodal
3	Finding Distributed Object-Centric Properties in Self-Supervised Transformers	提出Object-DINO，无需训练即可从自监督ViT中提取分布式对象中心属性，提升对象发现和多模态对齐。	large language model multimodal visual grounding
4	Bridging Pixels and Words: Mask-Aware Local Semantic Fusion for Multimodal Media Verification	提出MaLSF框架，通过掩码感知的局部语义融合解决多模态媒体验证难题。	multimodal
5	FairLLaVA: Fairness-Aware Parameter-Efficient Fine-Tuning for Large Vision-Language Assistants	FairLLaVA：面向视觉-语言大模型的公平性参数高效微调方法	large language model multimodal instruction following	✅
6	MA-Bench: Towards Fine-grained Micro-Action Understanding	提出MA-Bench基准测试，用于评估多模态大语言模型在细粒度微动作理解方面的能力。	large language model multimodal
7	Label-Free Cross-Task LoRA Merging with Null-Space Compression	提出基于零空间压缩的无标签跨任务LoRA融合方法，解决异构任务融合难题。	large language model foundation model
8	TaxaAdapter: Vision Taxonomy Models are Key to Fine-grained Image Generation over the Tree of Life	TaxaAdapter：利用视觉分类模型实现生命之树上的细粒度图像生成	large language model multimodal
9	SkinGPT-X: A Self-Evolving Collaborative Multi-Agent System for Transparent and Trustworthy Dermatological Diagnosis	SkinGPT-X：用于透明可信皮肤病诊断的自进化协同多智能体系统	large language model multimodal
10	Rethinking Token Pruning for Historical Screenshots in GUI Visual Agents: Semantic, Spatial, and Temporal Perspectives	提出有效的Token修剪策略以优化GUI视觉代理的历史截图处理	large language model multimodal
11	Make Geometry Matter for Spatial Reasoning	提出GeoSR框架，增强视觉语言模型在静态和动态场景中的空间推理能力	foundation model	✅
12	Generation Is Compression: Zero-Shot Video Coding via Stochastic Rectified Flow	提出生成式视频编解码器GVC，实现零样本视频编码，提升压缩效率。	foundation model
13	From Pen to Pixel: Translating Hand-Drawn Plots into Graphical APIs via a Novel Benchmark and Efficient Adapter	提出HDpy-13数据集和Plot-Adapter，提升手绘图到图形API的推荐效果。	large language model
14	HINT: Composed Image Retrieval with Dual-path Compositional Contextualized Network	提出双路组合上下文网络HINT，提升组合图像检索的匹配判别能力	multimodal	✅
15	Towards GUI Agents: Vision-Language Diffusion Models for GUI Grounding	提出基于扩散模型的GUI Agent，用于提升GUI环境下的目标定位与交互能力	multimodal
16	ComVi: Context-Aware Optimized Comment Display in Video Playback	ComVi：上下文感知的视频评论优化显示系统，提升用户沉浸感	TAMP

🔬 支柱二：RL算法与架构 (RL & Architecture) (14 篇)

#	题目	一句话要点	标签	🔗	⭐
17	Beyond Where to Look: Trajectory-Guided Reinforcement Learning for Multimodal RLVR	提出轨迹引导强化学习，提升多模态RLVR中视觉证据的有效利用	reinforcement learning large language model multimodal
18	MuDD: A Multimodal Deception Detection Dataset and GSR-Guided Progressive Distillation for Non-Contact Deception Detection	提出MuDD数据集和GPD框架，用于非接触式多模态欺骗检测。	representation learning distillation multimodal
19	GeoGuide: Hierarchical Geometric Guidance for Open-Vocabulary 3D Semantic Segmentation	提出GeoGuide以解决开放词汇3D语义分割中的几何学习问题	distillation open-vocabulary open vocabulary
20	Consistency Beyond Contrast: Enhancing Open-Vocabulary Object Detection Robustness via Contextual Consistency Learning	提出上下文一致性学习框架，提升开放词汇目标检测在不同场景下的鲁棒性。	contrastive learning open-vocabulary open vocabulary	✅
21	Dynamic Token Compression for Efficient Video Understanding through Reinforcement Learning	提出SCORE，通过强化学习动态压缩视频tokens，提升长视频理解效率。	reinforcement learning large language model multimodal
22	Learnable Quantum Efficiency Filters for Urban Hyperspectral Segmentation	提出可学习量子效率滤波器(LQE)用于城市高光谱图像分割，提升性能与可解释性。	SSM scene understanding HSI
23	FAST3DIS: Feed-forward Anchored Scene Transformer for 3D Instance Segmentation	提出FAST3DIS，一种用于3D实例分割的端到端Anchor场景Transformer。	representation learning contrastive learning scene understanding
24	HolisticSemGes: Semantic Grounding of Holistic Co-Speech Gesture Generation with Contrastive Flow-Matching	HolisticSemGes：基于对比流匹配的整体协同语音手势生成	flow matching
25	MPDiT: Multi-Patch Global-to-Local Transformer Architecture For Efficient Flow Matching and Diffusion Model	提出MPDiT多尺度Transformer架构，用于高效Flow Matching和扩散模型，显著降低计算成本。	flow matching	✅
26	4DRaL: Bridging 4D Radar with LiDAR for Place Recognition using Knowledge Distillation	提出4DRaL框架，利用知识蒸馏提升4D雷达在机器人定位中的鲁棒性。	distillation
27	HAD: Heterogeneity-Aware Distillation for Lifelong Heterogeneous Learning	提出HAD方法以解决终身异构学习中的知识保留问题	distillation
28	Efficient Few-Shot Learning for Edge AI via Knowledge Distillation on MobileViT	提出基于知识蒸馏的MobileViT边缘AI少样本学习方法，提升精度并降低功耗。	distillation
29	Learnable Instance Attention Filtering for Adaptive Detector Distillation	提出LIAF-KD，通过可学习的实例注意力过滤实现自适应目标检测器蒸馏	distillation
30	VLAgeBench: Benchmarking Large Vision-Language Models for Zero-Shot Human Age Estimation	VLAgeBench：评估大型视觉语言模型在零样本人脸年龄估计中的性能	MAE multimodal

🔬 支柱三：空间感知与语义 (Perception & Semantics) (9 篇)

#	题目	一句话要点	标签	🔗	⭐
31	OVI-MAP:Open-Vocabulary Instance-Semantic Mapping	OVI-MAP：解耦实例重建与语义推理，实现开放词汇实例语义地图构建	semantic mapping semantic map open-vocabulary
32	R-PGA: Robust Physical Adversarial Camouflage Generation via Relightable 3D Gaussian Splatting	提出R-PGA框架，通过可重光照3D高斯溅射生成鲁棒的物理对抗迷彩，提升自动驾驶安全性。	3D gaussian splatting 3DGS gaussian splatting
33	SDDF: Specificity-Driven Dynamic Focusing for Open-Vocabulary Camouflaged Object Detection	SDDF：面向开放词汇伪装目标检测的特异性驱动动态聚焦方法	open-vocabulary open vocabulary multimodal
34	Drive-Through 3D Vehicle Exterior Reconstruction via Dynamic-Scene SfM and Distortion-Aware Gaussian Splatting	提出一种动态场景下的车辆外观三维重建方法，解决经销商环境下的重建难题。	3D gaussian splatting gaussian splatting splatting
35	The Limits of Learning from Pictures and Text: Vision-Language Models and Embodied Scene Understanding	视觉-语言模型在具身场景理解中存在局限性，尤其在可供性方面	scene understanding affordance
36	Scene Grounding In the Wild	提出基于语义对齐的场景Grounding框架，解决大规模场景三维重建难题	3D gaussian splatting gaussian splatting splatting
37	GLINT: Modeling Scene-Scale Transparency via Gaussian Radiance Transport	GLINT：通过高斯辐射传输建模场景级透明度	3D gaussian splatting gaussian splatting splatting
38	Detailed Geometry and Appearance from Opportunistic Motion	利用物体运动，从稀疏视角重建高精度几何与外观	gaussian splatting splatting
39	Zero-Shot Depth from Defocus	提出FOSSA网络和ZEDD基准，实现零样本深度从离焦估计。	metric depth	✅

🔬 支柱一：机器人控制 (Robot Control) (3 篇)

#	题目	一句话要点	标签	🔗	⭐
40	CREval: An Automated Interpretable Evaluation for Creative Image Manipulation under Complex Instructions	提出CREval：用于复杂指令下创意图像编辑的自动化可解释评估框架	manipulation large language model multimodal
41	Real-Time Branch-to-Tool Distance Estimation for Autonomous UAV Pruning: Benchmarking Five DEFOM-Stereo Variants from Simulation to Jetson Deployment	针对无人机自主修剪，提出DEFOM-Stereo变体，实现实时分支距离估计。	sim-to-real MAE foundation model
42	DRUM: Diffusion-based Raydrop-aware Unpaired Mapping for Sim2Real LiDAR Segmentation	提出DRUM，一种基于扩散模型的、感知Raydrop的Sim2Real LiDAR语义分割方法	sim2real	✅

🔬 支柱六：视频提取与匹配 (Video Extraction) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
43	Beyond Language: Grounding Referring Expressions with Hand Pointing in Egocentric Vision	提出EgoPoint-Ground数据集和SV-CoT框架，解决以手势指向为线索的自中心视觉定位问题。	egocentric egocentric vision large language model
44	Meta-Learned Adaptive Optimization for Robust Human Mesh Recovery with Uncertainty-Aware Parameter Updates	提出基于元学习的自适应优化方法，提升人体网格重建的鲁棒性。	human mesh recovery

🔬 支柱八：物理动画 (Physics-based Animation) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
45	Knowledge is Power: Advancing Few-shot Action Recognition with Multimodal Semantics from MLLMs	提出FSAR-LLaVA，利用MLLM多模态语义知识增强少样本动作识别	spatiotemporal large language model multimodal
46	DUGAE: Unified Geometry and Attribute Enhancement via Spatiotemporal Correlations for G-PCC Compressed Dynamic Point Clouds	DUGAE：利用时空相关性统一增强G-PCC压缩动态点云的几何与属性	spatiotemporal	✅

🔬 支柱四：生成式动作 (Generative Motion) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
47	PAD-Hand: Physics-Aware Diffusion for Hand Motion Recovery	提出PAD-Hand，利用物理感知扩散模型恢复更真实的 hand motion	physically plausible motion recovery

🔬 支柱七：动作重定向 (Motion Retargeting) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
48	VGGRPO: Towards World-Consistent Video Generation with 4D Latent Reward	VGGRPO：利用4D潜在奖励实现世界一致性视频生成	geometric consistency foundation model

⬅️ 返回 cs.CV 首页 · 🏠 返回主页