cs.CV(2026-05-08)

📊 共 69 篇论文 | 🔗 17 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (24 🔗4) 支柱二:RL算法与架构 (RL & Architecture) (23 🔗7) 支柱三:空间感知与语义 (Perception & Semantics) (16 🔗3) 支柱六:视频提取与匹配 (Video Extraction) (4 🔗3) 支柱四:生成式动作 (Generative Motion) (2)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (24 篇)

#题目一句话要点标签🔗
1 AudioFace: Language-Assisted Speech-Driven Facial Animation with Multimodal Language Models 提出AudioFace框架,利用多模态大模型先验实现语言辅助的语音驱动面部动画生成。 large language model multimodal
2 ForgeVLA: Federated Vision-Language-Action Learning without Language Annotations 提出ForgeVLA框架,通过联邦学习实现无语言标注的视觉-语言-动作模型训练 vision-language-action VLA
3 Qwen3-VL-Seg: Unlocking Open-World Referring Segmentation with Vision-Language Grounding 提出Qwen3-VL-Seg框架,通过轻量级框引导掩码解码器实现高效开放世界指代分割。 large language model foundation model multimodal
4 STARFlow2: Bridging Language Models and Normalizing Flows for Unified Multimodal Generation 提出STARFlow2架构,通过自回归归一化流实现文本与图像的统一多模态生成 multimodal
5 GazeVLM: Active Vision via Internal Attention Control for Multimodal Reasoning 提出GazeVLM架构,通过内部注意力控制实现主动视觉推理以解决VLM被动处理的局限。 multimodal
6 Benchmarking Foundation Models for Renal Lesion Stratification in CT 基准测试医学基础模型在CT肾脏病变分层中的表现:放射组学仍是当前最优解 foundation model
7 LithoBench: Benchmarking Large Multimodal Models for Remote-Sensing Lithology Interpretation 提出LithoBench,用于评估大模型在遥感岩性判释中的地质语义理解能力 multimodal
8 Multimodal Stepwise Clinically-Guided Attention Learning for Pathological Complete Response Prediction in Breast Cancer 提出多模态逐步临床引导注意力学习框架,以提升乳腺癌病理完全缓解(pCR)的预测精度与泛化性。 multimodal
9 InterLV-Search: Benchmarking Interleaved Multimodal Agentic Search 提出InterLV-Search基准测试,评估交错式多模态Agent搜索能力 multimodal
10 UniD-Shift: Towards Unified Semantic Segmentation via Interpretable Share-Private Multimodal Decomposition 提出UniD-Shift框架,通过可解释的共享-私有多模态分解实现统一的2D-3D语义分割。 multimodal
11 RCoT-Seg: Reinforced Chain-of-Thought for Video Reasoning and Segmentation 提出RCoT-Seg框架,通过强化思维链实现视频推理与目标分割的解耦与优化 chain-of-thought
12 SphereVAD: Training-Free Video Anomaly Detection via Geodesic Inference on the Unit Hypersphere SphereVAD:基于单位超球面大地测量推理的免训练视频异常检测 large language model multimodal
13 Video Understanding Reward Modeling: A Robust Benchmark and Performant Reward Models 提出视频理解奖励模型基准VURB与大规模偏好数据集VUP-35K,显著提升视频生成与理解任务的对齐效果。 multimodal chain-of-thought
14 Anisotropic Modality Align 提出AnisoAlign框架,通过各向异性几何校正解决多模态表示中的模态鸿沟问题 large language model multimodal
15 ChartREG++: Towards Benchmarking and Improving Chart Referring Expression Grounding under Diverse referring clues and Multi-Target Referring 提出ChartREG++基准与代码驱动合成流水线,解决图表指代定位中的多目标与细粒度挑战 multimodal visual grounding
16 GPO-V: Jailbreak Diffusion Vision Language Model by Global Probability Optimization 提出GPO-V框架:通过全局概率优化实现对扩散视觉语言模型的越狱攻击 large language model multimodal
17 Masks Can Talk: Extracting Structured Text Information from Single-Modal Images for Remote Sensing Change Detection 提出S2M框架,通过将遥感变化检测掩码转化为结构化文本实现多模态监督 large language model multimodal
18 Towards Billion-scale Multi-modal Biometric Search 提出Bharat ABIS系统,实现十亿级多模态生物特征识别与高效去重 multimodal
19 Operating Within the Operational Design Domain: Zero-Shot Perception with Vision-Language Models 利用视觉语言模型实现零样本运行设计域(ODD)感知,提升自动驾驶系统的安全性与合规性。 chain-of-thought
20 TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos 提出TraceAV-Bench基准,旨在解决长视频中多跳视听轨迹推理与幻觉鲁棒性评估难题。 multimodal
21 PolarVLM: Bridging the Semantic-Physical Gap in Vision-Language Models 提出PolarVLM框架,通过融合偏振物理信息解决视觉语言模型在反射与透明场景下的语义理解难题。 multimodal
22 EditRefiner: A Human-Aligned Agentic Framework for Image Editing Refinement 提出EditRefiner:一种基于人类反馈的智能体框架,用于实现精准的图像编辑修正 instruction following
23 Real-IAD MVN: A Multi-View Normal Vector Dataset and Benchmark for High-Fidelity Industrial Anomaly Detection 提出Real-IAD MVN数据集与基准,通过多视角法线图解决工业微小几何缺陷检测难题 multimodal
24 Fine-tuning a vision-language model for fracture-surface morphology recognition 提出基于Qwen3-VL的微调框架,显著提升断口形貌识别的专业精度 multimodal

🔬 支柱二:RL算法与架构 (RL & Architecture) (23 篇)

#题目一句话要点标签🔗
25 One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy 提出OneWM-VLA模型,通过单Token帧压缩与流匹配目标优化视觉-语言-动作(VLA)策略的长程规划能力。 flow matching world model world models
26 Sword: Style-Robust World Models as Simulators via Dynamic Latent Bootstrapping for VLA Policy Post-Training 提出Sword世界模型框架,通过动态潜在引导与风格增强提升VLA策略训练的鲁棒性 reinforcement learning world model world models
27 ST-Gen4D: Embedding 4D Spatiotemporal Cognition into World Model for 4D Generation 提出ST-Gen4D框架,通过引入4D时空认知世界模型实现高一致性的4D生成。 world model world models spatiotemporal
28 Learning Visual Feature-Based World Models via Residual Latent Action 提出基于残差潜在动作(RLA)的世界模型,通过流匹配实现高效视觉特征预测与机器人策略学习。 policy learning flow matching world model
29 Pan-FM: A Pan-Organ Foundation Model with Saliency-Guided Masking for Missing Robustness 提出Pan-FM:一种基于显著性引导掩码的泛器官基础模型,以解决多模态医学影像中的缺失数据鲁棒性问题。 representation learning distillation foundation model
30 ReasonEdit: Towards Interpretable Image Editing Evaluation via Reinforcement Learning 提出ReasonEdit框架:构建大规模思维链数据集并利用强化学习实现可解释的图像编辑评估 reinforcement learning large language model multimodal
31 GEM: Generating LiDAR World Model via Deformable Mamba 提出GEM:基于可变形Mamba的生成式激光雷达世界模型,实现高保真环境动力学模拟 world model world models Mamba
32 Flow-OPD: On-Policy Distillation for Flow Matching Models 提出Flow-OPD框架,通过策略蒸馏解决流匹配模型多任务对齐中的奖励稀疏与梯度干扰问题。 flow matching distillation large language model
33 EmambaIR: Efficient Visual State Space Model for Event-guided Image Reconstruction 提出EmambaIR,一种高效视觉状态空间模型,用于事件引导的图像重建。 Mamba SSM state space model
34 Sat3R: Satellite DSM Reconstruction via RPC-Aware Depth Fine-tuning 提出Sat3R框架:通过RPC感知深度微调实现高效卫星DSM重建 MAE monocular depth metric depth
35 Diffusion-APO: Trajectory-Aware Direct Preference Alignment for Video Diffusion Transformers 提出Diffusion-APO算法,通过轨迹感知直接偏好对齐优化视频扩散模型 RLHF DPO direct preference optimization
36 ShellfishNet: A Domain-Specific Benchmark for Visual Recognition of Marine Molluscs 提出ShellfishNet基准数据集,旨在解决复杂水下环境中贝类物种识别的鲁棒性挑战 SSM state space model large language model
37 Breaking Spatial Uniformity: Prior-Guided Mamba with Radial Serialization for Lens Flare Removal 提出DeflareMambav2:基于径向序列化与先验引导的Mamba架构,实现高效去眩光处理 Mamba SSM state space model
38 VIMCAN: Visual-Inertial 3D Human Pose Estimation with Hybrid Mamba-Cross-Attention Network 提出VIMCAN混合架构,融合Mamba与交叉注意力机制实现高效视觉-惯性3D人体姿态估计 Mamba multimodal
39 BalCapRL: A Balanced Framework for RL-Based MLLM Image Captioning 提出BalCapRL框架,通过多目标强化学习优化多模态大模型的图像描述质量 reinforcement learning large language model multimodal
40 Neurosymbolic Framework for Concept-Driven Logical Reasoning in Skeleton-Based Human Action Recognition 提出基于神经符号框架的骨架动作识别方法,实现概念驱动的逻辑推理与可解释性。 representation learning motion representation
41 Teacher-Feature Drifting: One-Step Diffusion Distillation with Pretrained Diffusion Representations 提出基于预训练扩散模型的单步蒸馏方法,提升生成效率与图像质量。 flow matching distillation
42 SARA: Semantically Adaptive Relational Alignment for Video Diffusion Models 提出SARA框架:通过语义自适应关系对齐提升视频扩散模型的文本遵循能力 distillation foundation model
43 RELO: Reinforcement Learning to Localize for Visual Object Tracking 提出RELO强化学习定位框架,通过奖励驱动替代手工先验以优化视觉目标跟踪 reinforcement learning
44 Towards multi-modal forgery representation learning for AI-generated video detection and localization 提出多模态伪造表示学习框架,用于AI生成视频的检测与定位。 representation learning
45 Closed-Form Linear-Probe Dataset Distillation for Pre-trained Vision Models 提出CLP-DD方法,通过闭式解实现预训练视觉模型的高效数据集蒸馏 distillation
46 PRIMED: Adaptive Modality Suppression for Referring Audio-Visual Segmentation via Biased Competition 提出PRIMED框架,通过偏向竞争机制实现指称视听分割中的自适应模态抑制 contrastive learning multimodal
47 Implicit Preference Alignment for Human Image Animation 提出隐式偏好对齐(IPA)框架,解决人体图像动画中手部动作生成质量难题 reinforcement learning direct preference optimization

🔬 支柱三:空间感知与语义 (Perception & Semantics) (16 篇)

#题目一句话要点标签🔗
48 AsyncEvGS: Asynchronous Event-Assisted Gaussian Splatting for Handheld Motion-Blurred Scenes 提出AsyncEvGS,解决手持设备运动模糊场景下的3D高斯溅射重建问题 3D gaussian splatting 3DGS 3D reconstruction
49 High-Fidelity Surface Splatting-Based 3D Reconstruction from Multi-View Images 提出基于紧凑多项式核的隐式移动最小二乘法,实现高保真多视角三维重建 3D gaussian splatting 3DGS 3D reconstruction
50 SAM 3D Animal: Promptable Animal 3D Reconstruction from Images in the Wild 提出SAM 3D Animal框架,实现野外场景下基于提示的多动物3D重建 3D reconstruction sam 3D SAM 3D
51 From Pixels to Primitives: Scene Change Detection in 3D Gaussian Splatting 提出GD-DIFF方法:通过直接分析3D高斯基元属性实现场景变化检测 3D gaussian splatting gaussian splatting splatting
52 SatSurfGS: Generalizable 2D Gaussian Splatting for Sparse-View Satellite Surface Reconstruction 提出SatSurfGS:一种基于2D高斯溅射的通用稀疏视角卫星表面重建框架 3D gaussian splatting 3DGS gaussian splatting
53 Disambiguating 2D-3D Correspondences in Gaussian Splatting-based Feature Fields for Visual Localization 提出SplitGS-Loc框架,通过高斯分裂与多视图一致性优化解决GSFF视觉定位中的2D-3D对应歧义问题。 gaussian splatting splatting
54 Rethinking Dense Optical Flow without Test-Time Scaling 提出一种无需测试时迭代优化的稠密光流估计框架,利用基础模型先验替代计算密集型细化过程。 monocular depth optical flow foundation model
55 Differentiable Ray Tracing with Gaussians for Unified Radio Propagation Simulation and View Synthesis 提出基于高斯溅射的可微射线追踪框架,实现射频传播模拟与视觉合成的统一表示。 3D gaussian splatting 3DGS gaussian splatting
56 Learning Image-Adaptive Scale Fields for Metric Depth Recovery 提出基于图像自适应尺度场的度量深度恢复方法,解决单目深度估计的尺度不确定性问题。 depth estimation monocular depth metric depth
57 APEX: Assumption-free Projection-based Embedding eXamination Metric for Image Quality Assessment 提出APEX评估框架,利用切片Wasserstein距离实现无假设的图像质量评估 open-vocabulary open vocabulary foundation model
58 6D Pose Estimation via Keypoint Heatmap Regression with RGB-D Residual Neural Networks 提出基于关键点热图回归的模块化框架,通过RGB-D交叉融合提升6D位姿估计精度 6D pose estimation
59 Seeing Across Skies and Streets: Feedforward 3D Reconstruction from Satellite, Drone, and Ground Images 提出Cross3R模型,通过引入无人机视角实现跨卫星、无人机与地面图像的6-DoF 3D重建与定位 3D reconstruction
60 Rebalancing gradient to improve self-supervised co-training of depth, odometry and optical flow predictions CoopNet:通过重平衡梯度提升深度、里程计和光流联合自监督学习。 optical flow
61 Aquatic Neuromorphic Optical Flow 提出一种基于脉冲神经网络的自监督水下光流估计框架,实现资源受限环境下的高效感知。 optical flow
62 SplatWeaver: Learning to Allocate Gaussian Primitives for Generalizable Novel View Synthesis 提出SplatWeaver框架,通过动态分配高斯基元实现高效且可泛化的新视角合成 3D gaussian splatting gaussian splatting splatting
63 Divide and Conquer: Object Co-occurrence Helps Mitigate Simplicity Bias in OOD Detection 提出OCO框架,利用物体共现关系缓解OOD检测中的简单偏见问题 scene understanding

🔬 支柱六:视频提取与匹配 (Video Extraction) (4 篇)

#题目一句话要点标签🔗
64 EggHand: A Multimodal Foundation Model for Egocentric Hand Pose Forecasting EggHand:基于多模态基础模型的自中心视角手部姿态预测 egocentric vision-language-action VLA
65 EgoPro-Bench: Benchmarking Personalized Proactive Interaction in Egocentric Video Streams 提出EgoPro-Bench基准测试,旨在提升多模态大模型在第一人称视角下的个性化主动交互能力 egocentric large language model multimodal
66 EyeCue: Driver Cognitive Distraction Detection via Gaze-Empowered Egocentric Video Understanding 提出EyeCue框架:通过视线引导的自我中心视频理解技术,有效检测驾驶员认知分心。 egocentric multimodal
67 Uncovering and Shaping the Latent Representation of 3D Scene Topology in Vision-Language Models 揭示并重塑视觉语言模型中的3D场景拓扑潜空间,显著提升空间推理能力 egocentric

🔬 支柱四:生成式动作 (Generative Motion) (2 篇)

#题目一句话要点标签🔗
68 Towards Highly-Constrained Human Motion Generation with Retrieval-Guided Diffusion Noise Optimization 提出基于检索引导的扩散噪声优化框架,实现高约束条件下的零样本人体运动生成 motion generation human motion human motion generation
69 Task-Oriented Communication for Human Action Understanding via Edge-Cloud Co-Inference 提出面向任务的边缘-云协同通信框架TOAU,实现高效低延迟的人体动作理解 VQ-VAE

⬅️ 返回 cs.CV 首页 · 🏠 返回主页