cs.CV（2026-03-25）

📊 共 43 篇论文 | 🔗 11 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (13 🔗1) 支柱二：RL算法与架构 (RL & Architecture) (12 🔗3) 支柱三：空间感知与语义 (Perception & Semantics) (6 🔗2) 支柱一：机器人控制 (Robot Control) (5 🔗2) 支柱六：视频提取与匹配 (Video Extraction) (2 🔗1) 支柱八：物理动画 (Physics-based Animation) (2 🔗1) 支柱七：动作重定向 (Motion Retargeting) (2 🔗1) 支柱四：生成式动作 (Generative Motion) (1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (13 篇)

#	题目	一句话要点	标签	🔗	⭐
1	VERIA: Verification-Centric Multimodal Instance Augmentation for Long-Tailed 3D Object Detection	VERIA：面向长尾3D目标检测，提出验证中心的多模态实例增强方法	foundation model multimodal
2	AD-Reasoning: Multimodal Guideline-Guided Reasoning for Alzheimer's Disease Diagnosis	AD-Reasoning：提出多模态指导推理框架，用于阿尔茨海默病诊断	multimodal
3	A^3: Towards Advertising Aesthetic Assessment	提出A^3框架，用于解决广告美学评估中主观性强、缺乏可扩展性和标准的问题。	large language model multimodal chain-of-thought	✅
4	LensWalk: Agentic Video Understanding by Planning How You See in Videos	提出LensWalk以解决视频理解中的感知与推理脱节问题	large language model chain-of-thought
5	RefReward-SR: LR-Conditioned Reward Modeling for Preference-Aligned Super-Resolution	提出RefReward-SR，一种低分辨率条件奖励模型，用于偏好对齐的超分辨率重建。	large language model multimodal
6	When Understanding Becomes a Risk: Authenticity and Safety Risks in the Emerging Image Generation Paradigm	多模态大语言模型语义理解能力提升，但带来真实性和安全性风险	large language model multimodal
7	VOLMO: Versatile and Open Large Models for Ophthalmology	VOLMO：用于眼科的多功能开放大型模型框架	large language model multimodal
8	Towards Real-World Document Parsing via Realistic Scene Synthesis and Document-Aware Training	提出数据-训练协同框架，解决真实场景下文档解析难题。	large language model multimodal
9	POLY-SIM: Polyglot Speaker Identification with Missing Modality Grand Challenge 2026 Evaluation Plan	POLY-SIM挑战赛：针对缺失模态和跨语言场景的多模态说话人识别	multimodal
10	Counting Without Numbers \& Finding Without Words	提出融合视觉和听觉生物特征的多模态宠物重聚系统，解决传统方法仅依赖视觉外观的局限性。	multimodal
11	OmniWeaving: Towards Unified Video Generation with Free-form Composition and Reasoning	OmniWeaving：提出一种支持自由组合和推理的统一视频生成模型。	multimodal
12	Accelerating Diffusion-based Video Editing via Heterogeneous Caching: Beyond Full Computing at Sampled Denoising Timestep	提出HetCache框架，加速基于扩散模型的视频编辑，显著降低计算冗余。	foundation model
13	SilLang: Improving Gait Recognition with Silhouette Language Encoding	提出SilLang，利用轮廓语言编码提升步态识别性能	large language model

🔬 支柱二：RL算法与架构 (RL & Architecture) (12 篇)

#	题目	一句话要点	标签	🔗	⭐
14	Teacher-Student Diffusion Model for Text-Driven 3D Hand Motion Generation	提出TSHaMo：一种用于文本驱动3D手部动作生成的Teacher-Student扩散模型	teacher-student motion generation MANO
15	Latent-WAM: Latent World Action Modeling for End-to-End Autonomous Driving	Latent-WAM：基于潜在世界行动建模的端到端自动驾驶框架	world model world models world action model
16	Le MuMo JEPA: Multi-Modal Self-Supervised Representation Learning with Learnable Fusion Tokens	Le MuMo JEPA：利用可学习融合令牌的多模态自监督表征学习	JEPA representation learning multimodal
17	Toward Physically Consistent Driving Video World Models under Challenging Trajectories	提出PhyGenesis，解决自动驾驶世界模型在异常轨迹下的物理不一致性问题。	world model world models physically plausible	✅
18	RS-SSM: Refining Forgotten Specifics in State Space Model for Video Semantic Segmentation	提出RS-SSM，通过细化遗忘的特定信息，提升状态空间模型在视频语义分割中的性能。	SSM state space model spatiotemporal	✅
19	CAKE: Real-time Action Detection via Motion Distillation and Background-aware Contrastive Learning	CAKE：基于运动知识蒸馏和背景感知对比学习的实时行为检测	contrastive learning distillation optical flow
20	PointRFT: Explicit Reinforcement Fine-tuning for Point Cloud Few-shot Learning	PointRFT：用于点云少样本学习的显式强化微调方法	reinforcement learning representation learning reward design
21	DecepGPT: Schema-Driven Deception Detection with Multicultural Datasets and Robust Multimodal Learning	DecepGPT：提出模式驱动的多文化多模态欺骗检测方法，提升鲁棒性与可解释性。	distillation multimodal
22	CliPPER: Contextual Video-Language Pretraining on Long-form Intraoperative Surgical Procedures for Event Recognition	CliPPER：用于术中手术长视频事件识别的上下文视频-语言预训练	contrastive learning foundation model multimodal	✅
23	Powerful Teachers Matter: Text-Guided Multi-view Knowledge Distillation with Visual Prior Enhancement	提出文本引导的多视角知识蒸馏，提升视觉教师知识质量	distillation
24	SEGAR: Selective Enhancement for Generative Augmented Reality	SEGAR：用于生成式增强现实的选择性增强框架	world model world models
25	Heuristic Self-Paced Learning for Domain Adaptive Semantic Segmentation under Adverse Conditions	提出启发式自步学习框架，解决恶劣环境下域自适应语义分割的类别偏置问题	reinforcement learning curriculum learning

🔬 支柱三：空间感知与语义 (Perception & Semantics) (6 篇)

#	题目	一句话要点	标签	🔗	⭐
26	LightSplat: Fast and Memory-Efficient Open-Vocabulary 3D Scene Understanding in Five Seconds	LightSplat：快速且内存高效的开放词汇三维场景理解框架	scene understanding open-vocabulary open vocabulary	✅
27	Decompose and Transfer: CoT-Prompting Enhanced Alignment for Open-Vocabulary Temporal Action Detection	提出基于CoT-Prompting增强对齐的分解迁移框架，用于开放词汇时序动作检测。	open-vocabulary open vocabulary large language model
28	FilterGS: Traversal-Free Parallel Filtering and Adaptive Shrinking for Large-Scale LoD 3D Gaussian Splatting	FilterGS：用于大规模LoD 3D高斯溅射的无遍历并行过滤与自适应收缩	3D gaussian splatting gaussian splatting splatting	✅
29	COVTrack++: Learning Open-Vocabulary Multi-Object Tracking from Continuous Videos via a Synergistic Paradigm	COVTrack++：通过协同范式从连续视频中学习开放词汇多目标跟踪	open-vocabulary open vocabulary spatial relationship
30	SpectralSplats: Robust Differentiable Tracking via Spectral Moment Supervision	SpectralSplats：通过频谱矩监督实现鲁棒可微的3D高斯溅射跟踪	3D gaussian splatting 3DGS gaussian splatting
31	EndoVGGT: GNN-Enhanced Depth Estimation for Surgical 3D Reconstruction	EndoVGGT：基于GNN增强的深度估计，用于手术三维重建	depth estimation

🔬 支柱一：机器人控制 (Robot Control) (5 篇)

#	题目	一句话要点	标签	🔗	⭐
32	TAG: Target-Agnostic Guidance for Stable Object-Centric Inference in Vision-Language-Action Models	提出TAG，通过目标无关引导提升VLA模型在复杂场景下的目标定位稳定性	manipulation classifier-free guidance vision-language-action
33	Tutor-Student Reinforcement Learning: A Dynamic Curriculum for Robust Deepfake Detection	提出TSRL框架，动态优化深度伪造检测训练课程，提升模型泛化性。	manipulation reinforcement learning PPO	✅
34	LGTM: Training-Free Light-Guided Text-to-Image Diffusion Model via Initial Noise Manipulation	提出LGTM：一种免训练的光照引导文本到图像扩散模型，通过初始噪声操控实现。	manipulation
35	Latent Bias Alignment for High-Fidelity Diffusion Inversion in Real-World Image Reconstruction and Manipulation	提出潜空间偏差对齐方法，提升扩散模型在真实图像重建和编辑中的保真度	manipulation
36	Towards Training-Free Scene Text Editing	提出TextFlow，一种免训练的场景文本编辑框架，实现高保真文本修改。	manipulation	✅

🔬 支柱六：视频提取与匹配 (Video Extraction) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
37	Video-Only ToM: Enhancing Theory of Mind in Multimodal Large Language Models	提出VisionToM以增强多模态大语言模型的心智理论能力	egocentric large language model multimodal
38	HGGT: Robust and Flexible 3D Hand Mesh Reconstruction from Uncalibrated Images	提出HGGT，从无标定图像中稳健灵活地重建3D手部网格。	hand reconstruction foundation model	✅

🔬 支柱八：物理动画 (Physics-based Animation) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
39	Modeling Spatiotemporal Neural Frames for High Resolution Brain Dynamic	提出基于脑电信号条件下的时空神经帧建模方法，用于高分辨率脑动态功能磁共振成像重建。	spatiotemporal multimodal
40	Uncertainty-Aware Vision-based Risk Object Identification via Conformal Risk Tube Prediction	提出基于共形风险管预测的、不确定性感知的视觉风险目标识别方法	spatiotemporal	✅

🔬 支柱七：动作重定向 (Motion Retargeting) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
41	B-MoE: A Body-Part-Aware Mixture-of-Experts "All Parts Matter" Approach to Micro-Action Recognition	提出B-MoE模型，通过身体部位感知的专家混合方法解决微动作识别难题。	human motion
42	LaDy: Lagrangian-Dynamic Informed Network for Skeleton-based Action Segmentation via Spatial-Temporal Modulation	LaDy：利用拉格朗日动力学信息的骨骼动作分割网络，通过时空调制提升性能	human motion	✅

🔬 支柱四：生成式动作 (Generative Motion) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
43	ViHOI: Human-Object Interaction Synthesis with Visual Priors	ViHOI：利用视觉先验合成逼真的人-物交互	motion generation physically plausible human-object interaction

⬅️ 返回 cs.CV 首页 · 🏠 返回主页