cs.CV(2026-03-27)

📊 共 48 篇论文 | 🔗 9 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (16 🔗4) 支柱二:RL算法与架构 (RL & Architecture) (14 🔗2) 支柱三:空间感知与语义 (Perception & Semantics) (9 🔗1) 支柱一:机器人控制 (Robot Control) (3 🔗1) 支柱六:视频提取与匹配 (Video Extraction) (2) 支柱八:物理动画 (Physics-based Animation) (2 🔗1) 支柱四:生成式动作 (Generative Motion) (1) 支柱七:动作重定向 (Motion Retargeting) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (16 篇)

#题目一句话要点标签🔗
1 Reflect to Inform: Boosting Multimodal Reasoning via Information-Gain-Driven Verification 提出Visual Re-Examination (VRE)框架,提升多模态LLM的视觉推理能力并减少幻觉 large language model multimodal
2 SALMUBench: A Benchmark for Sensitive Association-Level Multimodal Unlearning SALMUBench:用于敏感关联级别多模态模型卸载的基准测试 multimodal
3 Finding Distributed Object-Centric Properties in Self-Supervised Transformers 提出Object-DINO,无需训练即可从自监督ViT中提取分布式对象中心属性,提升对象发现和多模态对齐。 large language model multimodal visual grounding
4 Bridging Pixels and Words: Mask-Aware Local Semantic Fusion for Multimodal Media Verification 提出MaLSF框架,通过掩码感知的局部语义融合解决多模态媒体验证难题。 multimodal
5 FairLLaVA: Fairness-Aware Parameter-Efficient Fine-Tuning for Large Vision-Language Assistants FairLLaVA:面向视觉-语言大模型的公平性参数高效微调方法 large language model multimodal instruction following
6 MA-Bench: Towards Fine-grained Micro-Action Understanding 提出MA-Bench基准测试,用于评估多模态大语言模型在细粒度微动作理解方面的能力。 large language model multimodal
7 Label-Free Cross-Task LoRA Merging with Null-Space Compression 提出基于零空间压缩的无标签跨任务LoRA融合方法,解决异构任务融合难题。 large language model foundation model
8 TaxaAdapter: Vision Taxonomy Models are Key to Fine-grained Image Generation over the Tree of Life TaxaAdapter:利用视觉分类模型实现生命之树上的细粒度图像生成 large language model multimodal
9 SkinGPT-X: A Self-Evolving Collaborative Multi-Agent System for Transparent and Trustworthy Dermatological Diagnosis SkinGPT-X:用于透明可信皮肤病诊断的自进化协同多智能体系统 large language model multimodal
10 Rethinking Token Pruning for Historical Screenshots in GUI Visual Agents: Semantic, Spatial, and Temporal Perspectives 提出有效的Token修剪策略以优化GUI视觉代理的历史截图处理 large language model multimodal
11 Make Geometry Matter for Spatial Reasoning 提出GeoSR框架,增强视觉语言模型在静态和动态场景中的空间推理能力 foundation model
12 Generation Is Compression: Zero-Shot Video Coding via Stochastic Rectified Flow 提出生成式视频编解码器GVC,实现零样本视频编码,提升压缩效率。 foundation model
13 From Pen to Pixel: Translating Hand-Drawn Plots into Graphical APIs via a Novel Benchmark and Efficient Adapter 提出HDpy-13数据集和Plot-Adapter,提升手绘图到图形API的推荐效果。 large language model
14 HINT: Composed Image Retrieval with Dual-path Compositional Contextualized Network 提出双路组合上下文网络HINT,提升组合图像检索的匹配判别能力 multimodal
15 Towards GUI Agents: Vision-Language Diffusion Models for GUI Grounding 提出基于扩散模型的GUI Agent,用于提升GUI环境下的目标定位与交互能力 multimodal
16 ComVi: Context-Aware Optimized Comment Display in Video Playback ComVi:上下文感知的视频评论优化显示系统,提升用户沉浸感 TAMP

🔬 支柱二:RL算法与架构 (RL & Architecture) (14 篇)

#题目一句话要点标签🔗
17 Beyond Where to Look: Trajectory-Guided Reinforcement Learning for Multimodal RLVR 提出轨迹引导强化学习,提升多模态RLVR中视觉证据的有效利用 reinforcement learning large language model multimodal
18 MuDD: A Multimodal Deception Detection Dataset and GSR-Guided Progressive Distillation for Non-Contact Deception Detection 提出MuDD数据集和GPD框架,用于非接触式多模态欺骗检测。 representation learning distillation multimodal
19 GeoGuide: Hierarchical Geometric Guidance for Open-Vocabulary 3D Semantic Segmentation 提出GeoGuide以解决开放词汇3D语义分割中的几何学习问题 distillation open-vocabulary open vocabulary
20 Consistency Beyond Contrast: Enhancing Open-Vocabulary Object Detection Robustness via Contextual Consistency Learning 提出上下文一致性学习框架,提升开放词汇目标检测在不同场景下的鲁棒性。 contrastive learning open-vocabulary open vocabulary
21 Dynamic Token Compression for Efficient Video Understanding through Reinforcement Learning 提出SCORE,通过强化学习动态压缩视频tokens,提升长视频理解效率。 reinforcement learning large language model multimodal
22 Learnable Quantum Efficiency Filters for Urban Hyperspectral Segmentation 提出可学习量子效率滤波器(LQE)用于城市高光谱图像分割,提升性能与可解释性。 SSM scene understanding HSI
23 FAST3DIS: Feed-forward Anchored Scene Transformer for 3D Instance Segmentation 提出FAST3DIS,一种用于3D实例分割的端到端Anchor场景Transformer。 representation learning contrastive learning scene understanding
24 HolisticSemGes: Semantic Grounding of Holistic Co-Speech Gesture Generation with Contrastive Flow-Matching HolisticSemGes:基于对比流匹配的整体协同语音手势生成 flow matching
25 MPDiT: Multi-Patch Global-to-Local Transformer Architecture For Efficient Flow Matching and Diffusion Model 提出MPDiT多尺度Transformer架构,用于高效Flow Matching和扩散模型,显著降低计算成本。 flow matching
26 4DRaL: Bridging 4D Radar with LiDAR for Place Recognition using Knowledge Distillation 提出4DRaL框架,利用知识蒸馏提升4D雷达在机器人定位中的鲁棒性。 distillation
27 HAD: Heterogeneity-Aware Distillation for Lifelong Heterogeneous Learning 提出HAD方法以解决终身异构学习中的知识保留问题 distillation
28 Efficient Few-Shot Learning for Edge AI via Knowledge Distillation on MobileViT 提出基于知识蒸馏的MobileViT边缘AI少样本学习方法,提升精度并降低功耗。 distillation
29 Learnable Instance Attention Filtering for Adaptive Detector Distillation 提出LIAF-KD,通过可学习的实例注意力过滤实现自适应目标检测器蒸馏 distillation
30 VLAgeBench: Benchmarking Large Vision-Language Models for Zero-Shot Human Age Estimation VLAgeBench:评估大型视觉语言模型在零样本人脸年龄估计中的性能 MAE multimodal

🔬 支柱三:空间感知与语义 (Perception & Semantics) (9 篇)

#题目一句话要点标签🔗
31 OVI-MAP:Open-Vocabulary Instance-Semantic Mapping OVI-MAP:解耦实例重建与语义推理,实现开放词汇实例语义地图构建 semantic mapping semantic map open-vocabulary
32 R-PGA: Robust Physical Adversarial Camouflage Generation via Relightable 3D Gaussian Splatting 提出R-PGA框架,通过可重光照3D高斯溅射生成鲁棒的物理对抗迷彩,提升自动驾驶安全性。 3D gaussian splatting 3DGS gaussian splatting
33 SDDF: Specificity-Driven Dynamic Focusing for Open-Vocabulary Camouflaged Object Detection SDDF:面向开放词汇伪装目标检测的特异性驱动动态聚焦方法 open-vocabulary open vocabulary multimodal
34 Drive-Through 3D Vehicle Exterior Reconstruction via Dynamic-Scene SfM and Distortion-Aware Gaussian Splatting 提出一种动态场景下的车辆外观三维重建方法,解决经销商环境下的重建难题。 3D gaussian splatting gaussian splatting splatting
35 The Limits of Learning from Pictures and Text: Vision-Language Models and Embodied Scene Understanding 视觉-语言模型在具身场景理解中存在局限性,尤其在可供性方面 scene understanding affordance
36 Scene Grounding In the Wild 提出基于语义对齐的场景Grounding框架,解决大规模场景三维重建难题 3D gaussian splatting gaussian splatting splatting
37 GLINT: Modeling Scene-Scale Transparency via Gaussian Radiance Transport GLINT:通过高斯辐射传输建模场景级透明度 3D gaussian splatting gaussian splatting splatting
38 Detailed Geometry and Appearance from Opportunistic Motion 利用物体运动,从稀疏视角重建高精度几何与外观 gaussian splatting splatting
39 Zero-Shot Depth from Defocus 提出FOSSA网络和ZEDD基准,实现零样本深度从离焦估计。 metric depth

🔬 支柱一:机器人控制 (Robot Control) (3 篇)

#题目一句话要点标签🔗
40 CREval: An Automated Interpretable Evaluation for Creative Image Manipulation under Complex Instructions 提出CREval:用于复杂指令下创意图像编辑的自动化可解释评估框架 manipulation large language model multimodal
41 Real-Time Branch-to-Tool Distance Estimation for Autonomous UAV Pruning: Benchmarking Five DEFOM-Stereo Variants from Simulation to Jetson Deployment 针对无人机自主修剪,提出DEFOM-Stereo变体,实现实时分支距离估计。 sim-to-real MAE foundation model
42 DRUM: Diffusion-based Raydrop-aware Unpaired Mapping for Sim2Real LiDAR Segmentation 提出DRUM,一种基于扩散模型的、感知Raydrop的Sim2Real LiDAR语义分割方法 sim2real

🔬 支柱六:视频提取与匹配 (Video Extraction) (2 篇)

#题目一句话要点标签🔗
43 Beyond Language: Grounding Referring Expressions with Hand Pointing in Egocentric Vision 提出EgoPoint-Ground数据集和SV-CoT框架,解决以手势指向为线索的自中心视觉定位问题。 egocentric egocentric vision large language model
44 Meta-Learned Adaptive Optimization for Robust Human Mesh Recovery with Uncertainty-Aware Parameter Updates 提出基于元学习的自适应优化方法,提升人体网格重建的鲁棒性。 human mesh recovery

🔬 支柱八:物理动画 (Physics-based Animation) (2 篇)

#题目一句话要点标签🔗
45 Knowledge is Power: Advancing Few-shot Action Recognition with Multimodal Semantics from MLLMs 提出FSAR-LLaVA,利用MLLM多模态语义知识增强少样本动作识别 spatiotemporal large language model multimodal
46 DUGAE: Unified Geometry and Attribute Enhancement via Spatiotemporal Correlations for G-PCC Compressed Dynamic Point Clouds DUGAE:利用时空相关性统一增强G-PCC压缩动态点云的几何与属性 spatiotemporal

🔬 支柱四:生成式动作 (Generative Motion) (1 篇)

#题目一句话要点标签🔗
47 PAD-Hand: Physics-Aware Diffusion for Hand Motion Recovery 提出PAD-Hand,利用物理感知扩散模型恢复更真实的 hand motion physically plausible motion recovery

🔬 支柱七:动作重定向 (Motion Retargeting) (1 篇)

#题目一句话要点标签🔗
48 VGGRPO: Towards World-Consistent Video Generation with 4D Latent Reward VGGRPO:利用4D潜在奖励实现世界一致性视频生成 geometric consistency foundation model

⬅️ 返回 cs.CV 首页 · 🏠 返回主页