cs.CV(2026-03-26)

📊 共 79 篇论文 | 🔗 19 篇有代码

🎯 兴趣领域导航

支柱二:RL算法与架构 (RL & Architecture) (28 🔗5) 支柱九:具身大模型 (Embodied Foundation Models) (19 🔗3) 支柱三:空间感知与语义 (Perception & Semantics) (17 🔗5) 支柱一:机器人控制 (Robot Control) (5 🔗2) 支柱八:物理动画 (Physics-based Animation) (4 🔗1) 支柱四:生成式动作 (Generative Motion) (2 🔗1) 支柱五:交互与反应 (Interaction & Reaction) (2) 支柱六:视频提取与匹配 (Video Extraction) (1 🔗1) 支柱七:动作重定向 (Motion Retargeting) (1 🔗1)

🔬 支柱二:RL算法与架构 (RL & Architecture) (28 篇)

#题目一句话要点标签🔗
1 VideoWeaver: Multimodal Multi-View Video-to-Video Transfer for Embodied Agents VideoWeaver:面向具身智能体的多模态多视角视频到视频转换框架 policy learning egocentric embodied AI
2 Hierarchy-Guided Multimodal Representation Learning for Taxonomic Inference 提出层次引导的多模态表示学习以解决生物多样性识别问题 representation learning foundation model multimodal
3 Bridging Perception and Reasoning: Token Reweighting for RLVR in Multimodal LLMs 提出Token-Reweighting策略,提升多模态LLM在RLVR任务中的感知与推理能力 reinforcement learning large language model multimodal
4 MSRL: Scaling Generative Multimodal Reward Modeling via Multi-Stage Reinforcement Learning 提出多阶段强化学习MSRL,用于扩展生成式多模态奖励模型的训练。 reinforcement learning distillation multimodal
5 GDPO-Listener: Expressive Interactive Head Generation via Auto-Regressive Flow Matching and Group reward-Decoupled Policy Optimization GDPO-Listener:通过自回归流匹配和分组解耦策略优化实现富有表现力的交互式头部生成 flow matching motion generation dyadic interaction
6 Multimodal Dataset Distillation via Phased Teacher Models 提出PTM-ST框架,解决多模态数据集蒸馏中教师模型知识动态演化捕捉不足的问题。 distillation multimodal
7 Out of Sight but Not Out of Mind: Hybrid Memory for Dynamic Video World Models 提出混合记忆机制,解决动态视频世界模型中主体消失重现问题 world model world models spatiotemporal
8 Label What Matters: Modality-Balanced and Difficulty-Aware Multimodal Active Learning 提出RL-MBA框架,解决多模态主动学习中模态平衡与样本难度动态变化问题。 reinforcement learning multimodal
9 Vega: Learning to Drive with Natural Language Instructions 提出Vega模型,通过自然语言指令实现个性化自动驾驶。 world model world models vision-language-action
10 LanteRn: Latent Visual Structured Reasoning LanteRn:提出基于隐空间视觉结构化推理框架,提升多模态模型视觉理解能力 reinforcement learning multimodal visual grounding
11 CLIP-RD: Relational Distillation for Efficient CLIP Knowledge Distillation 提出CLIP-RD,通过关系蒸馏提升CLIP模型知识蒸馏效率。 contrastive learning teacher-student distillation
12 VideoTIR: Accurate Understanding for Long Videos with Efficient Tool-Integrated Reasoning VideoTIR:利用强化学习和工具集成推理提升长视频理解的准确性和效率 reinforcement learning large language model multimodal
13 TIGFlow-GRPO: Trajectory Forecasting via Interaction-Aware Flow Matching and Reward-Driven Optimization 提出TIGFlow-GRPO框架,通过交互感知流匹配和奖励驱动优化实现更符合社会规范和物理约束的轨迹预测。 flow matching multimodal
14 Towards Controllable Low-Light Image Enhancement: A Continuous Multi-illumination Dataset and Efficient State Space Framework 提出可控低光图像增强方法以解决现有方法的不足 SSM state space model multimodal
15 FD$^2$: A Dedicated Framework for Fine-Grained Dataset Distillation 提出FD$^2$框架,用于细粒度数据集蒸馏,提升小样本学习性能。 distillation
16 AnyDoc: Enhancing Document Generation via Large-Scale HTML/CSS Data Synthesis and Height-Aware Reinforcement Optimization AnyDoc:通过大规模HTML/CSS数据合成与高度感知强化优化增强文档生成 reinforcement learning large language model
17 MoE-GRPO: Optimizing Mixture-of-Experts via Reinforcement Learning in Vision-Language Models 提出MoE-GRPO,通过强化学习优化MoE-VLMs中的专家路由,提升多模态理解能力。 reinforcement learning
18 Towards Video Anomaly Detection from Event Streams: A Baseline and Benchmark Datasets 提出EWAD框架,解决事件流视频异常检测中数据稀疏和模型训练难题。 distillation spatiotemporal
19 Image Rotation Angle Estimation: Comparing Circular-Aware Methods 针对图像旋转角度估计,对比研究了五种循环感知方法,并验证了概率方法的有效性。 Mamba MAE
20 Learning to Rank Caption Chains for Video-Text Alignment 提出基于排序优化的视频-文本对齐方法,提升长文本生成质量。 DPO direct preference optimization
21 Reinforcing Structured Chain-of-Thought for Video Understanding 提出Summary-Driven RL框架,增强MLLM在视频理解中的推理能力和泛化性 reinforcement learning large language model chain-of-thought
22 Collision-Aware Vision-Language Learning for End-to-End Driving with Multimodal Infraction Datasets 提出VLAAD和CARLA-Collide数据集,提升端到端自动驾驶的防碰撞能力。 representation learning multimodal
23 LEMON: a foundation model for nuclear morphology in Computational Pathology LEMON:用于计算病理学中细胞核形态的基础模型 representation learning foundation model
24 GazeQwen: Lightweight Gaze-Conditioned LLM Modulation for Streaming Video Understanding GazeQwen:基于注视感知的轻量级LLM调制方法,用于流视频理解 JEPA large language model multimodal
25 CLIP-RD: Relational Distillation for Efficient CLIP Knowledge Distillation 提出CLIP-RD,通过关系蒸馏提升CLIP模型知识蒸馏效率。 contrastive learning teacher-student distillation
26 Geo$^\textbf{2}$: Geometry-Guided Cross-view Geo-Localization and Image Synthesis Geo$^2$: 提出几何引导的跨视角地理定位与图像合成统一框架,实现SOTA性能。 flow matching VGGT foundation model
27 World Reasoning Arena 提出WR-Arena,用于评估世界模型在动作模拟、长时预测和推理规划方面的能力。 world model world models physically plausible
28 DiReCT: Disentangled Regularization of Contrastive Trajectories for Physics-Refined Video Generation DiReCT:解耦对比轨迹正则化,提升物理约束的视频生成质量 flow matching contrastive learning

🔬 支柱九:具身大模型 (Embodied Foundation Models) (19 篇)

#题目一句话要点标签🔗
29 Photon: Speedup Volume Understanding with Efficient Multimodal Large Language Models Photon:利用高效多模态大语言模型加速三维医学影像理解 large language model multimodal
30 Visual Attention Drifts,but Anchors Hold:Mitigating Hallucination in Multimodal Large Language Models via Cross-Layer Visual Anchors 提出CLVA,通过跨层视觉锚点缓解多模态大语言模型中的幻觉问题 large language model multimodal
31 Demographic Fairness in Multimodal LLMs: A Benchmark of Gender and Ethnicity Bias in Face Verification 评估多模态大语言模型在人脸验证中的性别和种族偏见 large language model multimodal
32 MuRF: Unlocking the Multi-Scale Potential of Vision Foundation Models MuRF:释放视觉基础模型的多尺度潜力,提升推理性能 foundation model
33 Seeing to Ground: Visual Attention for Hallucination-Resilient MDLLMs 提出VISAGE框架,通过视觉注意力校准,提升MDLLM的多模态抗幻觉能力。 large language model multimodal visual grounding
34 GeoHeight-Bench: Towards Height-Aware Multimodal Reasoning in Remote Sensing 提出GeoHeight-Bench,解决遥感领域大模型缺乏高度感知能力的问题 multimodal
35 SlotVTG: Object-Centric Adapter for Generalizable Video Temporal Grounding 提出SlotVTG以解决视频时间定位中的对象中心学习问题 large language model multimodal
36 Pixelis: Reasoning in Pixels, from Seeing to Acting Pixelis:提出像素级推理Agent,通过执行操作和学习结果,提升视觉语言系统的泛化性和物理基础。 multimodal chain-of-thought
37 Self-Corrected Image Generation with Explainable Latent Rewards 提出xLARD框架,利用可解释的隐空间奖励实现自校正图像生成。 large language model multimodal
38 BFMD: A Full-Match Badminton Dense Dataset for Dense Shot Captioning 提出BFMD羽毛球全场密集数据集,用于羽毛球击球事件的密集描述 multimodal
39 Knowledge-Guided Failure Prediction: Detecting When Object Detectors Miss Safety-Critical Objects 提出知识引导的失效预测方法,用于检测目标检测器在安全关键场景下的漏检。 foundation model
40 PMT: Plain Mask Transformer for Image and Video Segmentation with Frozen Vision Encoders 提出PMT:一种基于冻结视觉编码器的图像和视频分割Plain Mask Transformer foundation model
41 GIFT: Global Irreplaceability Frame Targeting for Efficient Video Understanding GIFT:面向高效视频理解的全局不可替代帧选择方法 large language model
42 Synergistic Event-SVE Imaging for Quantitative Propellant Combustion Diagnostics 提出协同Event-SVE成像系统,用于定量推进剂燃烧诊断,解决高动态范围和烟雾遮蔽问题。 multimodal
43 BEVMAPMATCH: Multimodal BEV Neural Map Matching for Robust Re-Localization of Autonomous Vehicles BEVMapMatch:用于自动驾驶车辆在恶劣环境下鲁棒重定位的多模态BEV神经地图匹配方法 multimodal
44 Good Scores, Bad Data: A Metric for Multimodal Coherence 提出多模态一致性评分以解决数据不一致问题 multimodal
45 THFM: A Unified Video Foundation Model for 4D Human Perception and Beyond 提出THFM:一个统一的视频基础模型,用于4D人体感知及其他任务 foundation model
46 Speech-Synchronized Whiteboard Generation via VLM-Driven Structured Drawing Representations 提出基于VLM的语音同步白板生成方法,解决教育视频内容自动生成问题 multimodal TAMP
47 GUIDE: A Benchmark for Understanding and Assisting Users in Open-Ended GUI Tasks 提出GUIDE基准,用于理解和辅助用户完成开放式GUI任务 multimodal

🔬 支柱三:空间感知与语义 (Perception & Semantics) (17 篇)

#题目一句话要点标签🔗
48 Learning Explicit Continuous Motion Representation for Dynamic Gaussian Splatting from Monocular Videos 提出基于SE(3) B样条运动基的动态高斯溅射方法,用于单目视频高质量动态场景重建。 gaussian splatting splatting motion representation
49 AirSplat: Alignment and Rating for Robust Feed-Forward 3D Gaussian Splatting AirSplat:对齐与评分,实现稳健的前馈3D高斯溅射 3D gaussian splatting gaussian splatting splatting
50 Relaxed Rigidity with Ray-based Grouping for Dynamic Gaussian Splatting 提出基于射线分组的松弛刚性方法,用于动态高斯溅射,提升单目视频重建质量。 3D gaussian splatting gaussian splatting splatting
51 ViewSplat: View-Adaptive Dynamic Gaussian Splatting for Feed-Forward Synthesis ViewSplat:提出视角自适应动态高斯溅射,实现快速高保真新视角合成 3D gaussian splatting gaussian splatting splatting
52 Towards Comprehensive Real-Time Scene Understanding in Ophthalmic Surgery through Multimodal Image Fusion 提出一种多模态图像融合网络,用于眼科手术中实时场景理解和器械精准追踪。 scene understanding multimodal
53 Towards Foundation Models for 3D Scene Understanding: Instance-Aware Self-Supervised Learning for Point Clouds PointINS:面向点云实例感知的自监督学习,提升3D场景理解能力 scene understanding foundation model
54 MegaFlow: Zero-Shot Large Displacement Optical Flow MegaFlow:提出一种零样本大位移光流估计方法,无需特定领域微调。 optical flow motion estimation
55 Training-free Detection and 6D Pose Estimation of Unseen Surgical Instruments 提出一种无需训练的 surgical instrument 6D位姿估计方法,适用于未知器械。 scene understanding 6D pose estimation geometric consistency
56 Less Gaussians, Texture More: 4K Feed-Forward Textured Splatting LGTM:通过纹理化高斯点实现4K分辨率前馈 novel view synthesis 3D gaussian splatting gaussian splatting splatting
57 Colon-Bench: An Agentic Workflow for Scalable Dense Lesion Annotation in Full-Procedure Colonoscopy Videos 提出Colon-Bench,用于结肠镜视频中可扩展的密集病灶标注,以促进AI在结肠癌早期筛查的应用。 open-vocabulary open vocabulary large language model
58 EgoXtreme: A Dataset for Robust Object Pose Estimation in Egocentric Views under Extreme Conditions EgoXtreme:用于极端条件下以自我为中心的视角进行鲁棒物体姿态估计的数据集 6D pose estimation egocentric egocentric vision
59 GaussFusion: Improving 3D Reconstruction in the Wild with A Geometry-Informed Video Generator GaussFusion:利用几何信息视频生成器提升野外场景3D重建质量 3D gaussian splatting 3DGS gaussian splatting
60 MoRGS: Efficient Per-Gaussian Motion Reasoning for Streamable Dynamic 3D Scenes MoRGS:高效的Per-Gaussian运动推理,用于可流式传输的动态3D场景重建 3D gaussian splatting gaussian splatting splatting
61 Interpretable Zero-shot Referring Expression Comprehension with Query-driven Scene Graphs 提出SGREC以解决零-shot指代表达理解问题 scene understanding spatial relationship large language model
62 Infinite Gaze Generation for Videos with Autoregressive Diffusion 提出基于自回归扩散模型的无限注视生成框架,用于预测任意长度视频中的人类注视轨迹。 scene understanding multimodal TAMP
63 HeSS: Head Sensitivity Score for Sparsity Redistribution in VGGT 提出HeSS,通过头部敏感度指导VGGT稀疏化,提升高稀疏度下的精度。 VGGT
64 Few TensoRF: Enhance the Few-shot on Tensorial Radiance Fields Few TensoRF:结合张量分解与频率正则化,提升少样本3D重建效果 NeRF

🔬 支柱一:机器人控制 (Robot Control) (5 篇)

#题目一句话要点标签🔗
65 LaMP: Learning Vision-Language-Action Policies with 3D Scene Flow as Latent Motion Prior LaMP:利用3D场景流作为潜在运动先验,学习视觉-语言-动作策略 manipulation flow matching scene flow
66 Probabilistic Concept Graph Reasoning for Multimodal Misinformation Detection 提出概率概念图推理框架PCGR,用于可解释的多模态虚假信息检测。 manipulation large language model multimodal
67 Beyond Attention Magnitude: Leveraging Inter-layer Rank Consistency for Efficient Vision-Language-Action Models 提出TIES框架,利用层间排序一致性提升VLA模型效率并超越注意力幅度选择。 manipulation vision-language-action VLA
68 PAWS: Perception of Articulation in the Wild at Scale from Egocentric Videos PAWS:从第一视角视频大规模感知自然场景中的物体铰接 manipulation scene understanding egocentric
69 THEMIS: Towards Holistic Evaluation of MLLMs for Scientific Paper Fraud Forensics 提出THEMIS基准,用于多模态大语言模型在科学论文欺诈取证中的整体评估 manipulation large language model multimodal

🔬 支柱八:物理动画 (Physics-based Animation) (4 篇)

#题目一句话要点标签🔗
70 UNIC: Neural Garment Deformation Field for Real-time Clothed Character Animation 提出UNIC:一种基于神经形变场的服装动画实时生成方法 character animation
71 PackForcing: Short Video Training Suffices for Long Video Sampling and Long Context Inference PackForcing:利用短视频训练实现长视频采样和长上下文推理 spatiotemporal
72 GeoNDC: A Queryable Neural Data Cube for Planetary-Scale Earth Observation GeoNDC:一种可查询的行星尺度地球观测神经数据立方体 spatiotemporal
73 Dynamic LIBRAS Gesture Recognition via CNN over Spatiotemporal Matrix Representation 提出基于时空矩阵和CNN的动态LIBRAS手势识别方法,用于家庭自动化设备控制 spatiotemporal

🔬 支柱四:生成式动作 (Generative Motion) (2 篇)

#题目一句话要点标签🔗
74 Bilingual Text-to-Motion Generation: A New Benchmark and Baselines 提出BiHumanML3D基准以解决双语文本到动作生成问题 motion diffusion text-to-motion motion synthesis
75 Unleashing Guidance Without Classifiers for Human-Object Interaction Animation 提出LIGHT以解决人机交互动画生成中的接触质量问题 classifier-free guidance contact-aware human-object interaction

🔬 支柱五:交互与反应 (Interaction & Reaction) (2 篇)

#题目一句话要点标签🔗
76 Challenges in Hyperspectral Imaging for Autonomous Driving: The HSI-Drive Case 针对自动驾驶高光谱成像挑战,分析HSI-Drive数据集上的视觉技术 HSI
77 ArtHOI: Taming Foundation Models for Monocular 4D Reconstruction of Hand-Articulated-Object Interactions ArtHOI:利用基础模型进行单目4D手部-可动物体交互重建 HOI large language model foundation model

🔬 支柱六:视频提取与匹配 (Video Extraction) (1 篇)

#题目一句话要点标签🔗
78 AG-EgoPose: Leveraging Action-Guided Motion and Kinematic Joint Encoding for Egocentric 3D Pose Estimation AG-EgoPose:利用动作引导的运动和关节编码进行第一人称3D姿态估计 egocentric first-person view

🔬 支柱七:动作重定向 (Motion Retargeting) (1 篇)

#题目一句话要点标签🔗
79 ICTPolarReal: A Polarized Reflection and Material Dataset of Real World Objects 提出ICTPolarReal数据集,用于提升真实世界物体反射和材质建模的性能。 geometric consistency

⬅️ 返回 cs.CV 首页 · 🏠 返回主页