cs.CV(2026-03-25)

📊 共 43 篇论文 | 🔗 11 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (13 🔗1) 支柱二:RL算法与架构 (RL & Architecture) (12 🔗3) 支柱三:空间感知与语义 (Perception & Semantics) (6 🔗2) 支柱一:机器人控制 (Robot Control) (5 🔗2) 支柱六:视频提取与匹配 (Video Extraction) (2 🔗1) 支柱八:物理动画 (Physics-based Animation) (2 🔗1) 支柱七:动作重定向 (Motion Retargeting) (2 🔗1) 支柱四:生成式动作 (Generative Motion) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (13 篇)

#题目一句话要点标签🔗
1 VERIA: Verification-Centric Multimodal Instance Augmentation for Long-Tailed 3D Object Detection VERIA:面向长尾3D目标检测,提出验证中心的多模态实例增强方法 foundation model multimodal
2 AD-Reasoning: Multimodal Guideline-Guided Reasoning for Alzheimer's Disease Diagnosis AD-Reasoning:提出多模态指导推理框架,用于阿尔茨海默病诊断 multimodal
3 A^3: Towards Advertising Aesthetic Assessment 提出A^3框架,用于解决广告美学评估中主观性强、缺乏可扩展性和标准的问题。 large language model multimodal chain-of-thought
4 LensWalk: Agentic Video Understanding by Planning How You See in Videos 提出LensWalk以解决视频理解中的感知与推理脱节问题 large language model chain-of-thought
5 RefReward-SR: LR-Conditioned Reward Modeling for Preference-Aligned Super-Resolution 提出RefReward-SR,一种低分辨率条件奖励模型,用于偏好对齐的超分辨率重建。 large language model multimodal
6 When Understanding Becomes a Risk: Authenticity and Safety Risks in the Emerging Image Generation Paradigm 多模态大语言模型语义理解能力提升,但带来真实性和安全性风险 large language model multimodal
7 VOLMO: Versatile and Open Large Models for Ophthalmology VOLMO:用于眼科的多功能开放大型模型框架 large language model multimodal
8 Towards Real-World Document Parsing via Realistic Scene Synthesis and Document-Aware Training 提出数据-训练协同框架,解决真实场景下文档解析难题。 large language model multimodal
9 POLY-SIM: Polyglot Speaker Identification with Missing Modality Grand Challenge 2026 Evaluation Plan POLY-SIM挑战赛:针对缺失模态和跨语言场景的多模态说话人识别 multimodal
10 Counting Without Numbers \& Finding Without Words 提出融合视觉和听觉生物特征的多模态宠物重聚系统,解决传统方法仅依赖视觉外观的局限性。 multimodal
11 OmniWeaving: Towards Unified Video Generation with Free-form Composition and Reasoning OmniWeaving:提出一种支持自由组合和推理的统一视频生成模型。 multimodal
12 Accelerating Diffusion-based Video Editing via Heterogeneous Caching: Beyond Full Computing at Sampled Denoising Timestep 提出HetCache框架,加速基于扩散模型的视频编辑,显著降低计算冗余。 foundation model
13 SilLang: Improving Gait Recognition with Silhouette Language Encoding 提出SilLang,利用轮廓语言编码提升步态识别性能 large language model

🔬 支柱二:RL算法与架构 (RL & Architecture) (12 篇)

#题目一句话要点标签🔗
14 Teacher-Student Diffusion Model for Text-Driven 3D Hand Motion Generation 提出TSHaMo:一种用于文本驱动3D手部动作生成的Teacher-Student扩散模型 teacher-student motion generation MANO
15 Latent-WAM: Latent World Action Modeling for End-to-End Autonomous Driving Latent-WAM:基于潜在世界行动建模的端到端自动驾驶框架 world model world models world action model
16 Le MuMo JEPA: Multi-Modal Self-Supervised Representation Learning with Learnable Fusion Tokens Le MuMo JEPA:利用可学习融合令牌的多模态自监督表征学习 JEPA representation learning multimodal
17 Toward Physically Consistent Driving Video World Models under Challenging Trajectories 提出PhyGenesis,解决自动驾驶世界模型在异常轨迹下的物理不一致性问题。 world model world models physically plausible
18 RS-SSM: Refining Forgotten Specifics in State Space Model for Video Semantic Segmentation 提出RS-SSM,通过细化遗忘的特定信息,提升状态空间模型在视频语义分割中的性能。 SSM state space model spatiotemporal
19 CAKE: Real-time Action Detection via Motion Distillation and Background-aware Contrastive Learning CAKE:基于运动知识蒸馏和背景感知对比学习的实时行为检测 contrastive learning distillation optical flow
20 PointRFT: Explicit Reinforcement Fine-tuning for Point Cloud Few-shot Learning PointRFT:用于点云少样本学习的显式强化微调方法 reinforcement learning representation learning reward design
21 DecepGPT: Schema-Driven Deception Detection with Multicultural Datasets and Robust Multimodal Learning DecepGPT:提出模式驱动的多文化多模态欺骗检测方法,提升鲁棒性与可解释性。 distillation multimodal
22 CliPPER: Contextual Video-Language Pretraining on Long-form Intraoperative Surgical Procedures for Event Recognition CliPPER:用于术中手术长视频事件识别的上下文视频-语言预训练 contrastive learning foundation model multimodal
23 Powerful Teachers Matter: Text-Guided Multi-view Knowledge Distillation with Visual Prior Enhancement 提出文本引导的多视角知识蒸馏,提升视觉教师知识质量 distillation
24 SEGAR: Selective Enhancement for Generative Augmented Reality SEGAR:用于生成式增强现实的选择性增强框架 world model world models
25 Heuristic Self-Paced Learning for Domain Adaptive Semantic Segmentation under Adverse Conditions 提出启发式自步学习框架,解决恶劣环境下域自适应语义分割的类别偏置问题 reinforcement learning curriculum learning

🔬 支柱三:空间感知与语义 (Perception & Semantics) (6 篇)

#题目一句话要点标签🔗
26 LightSplat: Fast and Memory-Efficient Open-Vocabulary 3D Scene Understanding in Five Seconds LightSplat:快速且内存高效的开放词汇三维场景理解框架 scene understanding open-vocabulary open vocabulary
27 Decompose and Transfer: CoT-Prompting Enhanced Alignment for Open-Vocabulary Temporal Action Detection 提出基于CoT-Prompting增强对齐的分解迁移框架,用于开放词汇时序动作检测。 open-vocabulary open vocabulary large language model
28 FilterGS: Traversal-Free Parallel Filtering and Adaptive Shrinking for Large-Scale LoD 3D Gaussian Splatting FilterGS:用于大规模LoD 3D高斯溅射的无遍历并行过滤与自适应收缩 3D gaussian splatting gaussian splatting splatting
29 COVTrack++: Learning Open-Vocabulary Multi-Object Tracking from Continuous Videos via a Synergistic Paradigm COVTrack++:通过协同范式从连续视频中学习开放词汇多目标跟踪 open-vocabulary open vocabulary spatial relationship
30 SpectralSplats: Robust Differentiable Tracking via Spectral Moment Supervision SpectralSplats:通过频谱矩监督实现鲁棒可微的3D高斯溅射跟踪 3D gaussian splatting 3DGS gaussian splatting
31 EndoVGGT: GNN-Enhanced Depth Estimation for Surgical 3D Reconstruction EndoVGGT:基于GNN增强的深度估计,用于手术三维重建 depth estimation

🔬 支柱一:机器人控制 (Robot Control) (5 篇)

#题目一句话要点标签🔗
32 TAG: Target-Agnostic Guidance for Stable Object-Centric Inference in Vision-Language-Action Models 提出TAG,通过目标无关引导提升VLA模型在复杂场景下的目标定位稳定性 manipulation classifier-free guidance vision-language-action
33 Tutor-Student Reinforcement Learning: A Dynamic Curriculum for Robust Deepfake Detection 提出TSRL框架,动态优化深度伪造检测训练课程,提升模型泛化性。 manipulation reinforcement learning PPO
34 LGTM: Training-Free Light-Guided Text-to-Image Diffusion Model via Initial Noise Manipulation 提出LGTM:一种免训练的光照引导文本到图像扩散模型,通过初始噪声操控实现。 manipulation
35 Latent Bias Alignment for High-Fidelity Diffusion Inversion in Real-World Image Reconstruction and Manipulation 提出潜空间偏差对齐方法,提升扩散模型在真实图像重建和编辑中的保真度 manipulation
36 Towards Training-Free Scene Text Editing 提出TextFlow,一种免训练的场景文本编辑框架,实现高保真文本修改。 manipulation

🔬 支柱六:视频提取与匹配 (Video Extraction) (2 篇)

#题目一句话要点标签🔗
37 Video-Only ToM: Enhancing Theory of Mind in Multimodal Large Language Models 提出VisionToM以增强多模态大语言模型的心智理论能力 egocentric large language model multimodal
38 HGGT: Robust and Flexible 3D Hand Mesh Reconstruction from Uncalibrated Images 提出HGGT,从无标定图像中稳健灵活地重建3D手部网格。 hand reconstruction foundation model

🔬 支柱八:物理动画 (Physics-based Animation) (2 篇)

#题目一句话要点标签🔗
39 Modeling Spatiotemporal Neural Frames for High Resolution Brain Dynamic 提出基于脑电信号条件下的时空神经帧建模方法,用于高分辨率脑动态功能磁共振成像重建。 spatiotemporal multimodal
40 Uncertainty-Aware Vision-based Risk Object Identification via Conformal Risk Tube Prediction 提出基于共形风险管预测的、不确定性感知的视觉风险目标识别方法 spatiotemporal

🔬 支柱七:动作重定向 (Motion Retargeting) (2 篇)

#题目一句话要点标签🔗
41 B-MoE: A Body-Part-Aware Mixture-of-Experts "All Parts Matter" Approach to Micro-Action Recognition 提出B-MoE模型,通过身体部位感知的专家混合方法解决微动作识别难题。 human motion
42 LaDy: Lagrangian-Dynamic Informed Network for Skeleton-based Action Segmentation via Spatial-Temporal Modulation LaDy:利用拉格朗日动力学信息的骨骼动作分割网络,通过时空调制提升性能 human motion

🔬 支柱四:生成式动作 (Generative Motion) (1 篇)

#题目一句话要点标签🔗
43 ViHOI: Human-Object Interaction Synthesis with Visual Priors ViHOI:利用视觉先验合成逼真的人-物交互 motion generation physically plausible human-object interaction

⬅️ 返回 cs.CV 首页 · 🏠 返回主页