cs.CV(2026-03-23)

📊 共 48 篇论文 | 🔗 10 篇有代码

🎯 兴趣领域导航

支柱二:RL算法与架构 (RL & Architecture) (17 🔗4) 支柱九:具身大模型 (Embodied Foundation Models) (10 🔗2) 支柱三:空间感知与语义 (Perception & Semantics) (8 🔗3) 支柱一:机器人控制 (Robot Control) (5 🔗1) 支柱七:动作重定向 (Motion Retargeting) (4) 支柱四:生成式动作 (Generative Motion) (1) 支柱六:视频提取与匹配 (Video Extraction) (1) 支柱八:物理动画 (Physics-based Animation) (1) 支柱五:交互与反应 (Interaction & Reaction) (1)

🔬 支柱二:RL算法与架构 (RL & Architecture) (17 篇)

#题目一句话要点标签🔗
1 Mamba-VMR: Multimodal Query Augmentation via Generated Videos for Precise Temporal Grounding Mamba-VMR:通过生成视频增强多模态查询,实现精确时序定位 Mamba multimodal
2 SpatialReward: Verifiable Spatial Reward Modeling for Fine-Grained Spatial Consistency in Text-to-Image Generation 提出SpatialReward,提升文本到图像生成中细粒度空间一致性 reinforcement learning spatial relationship visual grounding
3 Speed by Simplicity: A Single-Stream Architecture for Fast Audio-Video Generative Foundation Model daVinci-MagiHuman:基于单流Transformer的快速音视频生成基础模型 distillation foundation model
4 PPGL-Swarm: Integrated Multimodal Risk Stratification and Hereditary Syndrome Detection in Pheochromocytoma and Paraganglioma PPGL-Swarm:用于嗜铬细胞瘤和副神经节瘤的多模态风险分层与遗传综合征检测 reinforcement learning multimodal
5 ALADIN:Attribute-Language Distillation Network for Person Re-Identification 提出ALADIN,通过属性-语言蒸馏网络提升行人重识别的细粒度特征学习能力。 representation learning distillation multimodal
6 Seeing is Improving: Visual Feedback for Iterative Text Layout Refinement 提出VFLM:利用视觉反馈迭代优化文本布局生成,提升可读性和美观性。 reinforcement learning large language model multimodal
7 Multi-View Deformable Convolution Meets Visual Mamba for Coronary Artery Segmentation 提出MDSVM-UNet,结合多视角可变形卷积与视觉Mamba用于冠状动脉分割 Mamba SSM state space model
8 Clinical Graph-Mediated Distillation for Unpaired MRI-to-CFI Hypertension Prediction 提出临床图介导蒸馏方法,用于无配对MRI-眼底图像的高血压预测。 distillation multimodal
9 Image-Conditioned Adaptive Parameter Tuning for Visual Odometry Frontends 提出图像条件自适应参数调整的视觉里程计前端,提升资源受限机器人的性能。 reinforcement learning visual odometry
10 A Latent Representation Learning Framework for Hyperspectral Image Emulation in Remote Sensing 提出基于隐空间表征学习的高光谱图像仿真框架,加速遥感应用开发。 representation learning HSI
11 Adaptive Video Distillation: Mitigating Oversaturation and Temporal Collapse in Few-Step Generation 提出自适应视频蒸馏框架,解决少步生成中过饱和与时间塌陷问题 distillation physically plausible
12 ACPO: Counteracting Likelihood Displacement in Vision-Language Alignment with Asymmetric Constraints 提出ACPO,通过非对称约束优化解决视觉-语言对齐中的似然漂移问题 DPO direct preference optimization multimodal
13 WorldCache: Content-Aware Caching for Accelerated Video World Models 提出WorldCache,通过感知约束动态缓存加速视频世界模型的推理。 world model
14 Omni-WorldBench: Towards a Comprehensive Interaction-Centric Evaluation for World Models Omni-WorldBench:面向交互中心的世界模型综合评估基准 world model
15 Manifold-Aware Exploration for Reinforcement Learning in Video Generation 提出SAGE-GRPO,通过流形感知探索提升视频生成强化学习的稳定性和质量。 reinforcement learning
16 Rethinking SAR ATR: A Target-Aware Frequency-Spatial Enhancement Framework with Noise-Resilient Knowledge Guidance 提出一种目标感知的频域-空域增强框架,提升SAR图像在噪声环境下的目标识别精度。 representation learning teacher-student distillation
17 From Part to Whole: 3D Generative World Model with an Adaptive Structural Hierarchy 提出自适应结构层次的3D生成世界模型,解决单图3D生成中结构复杂性和泛化性问题。 world model

🔬 支柱九:具身大模型 (Embodied Foundation Models) (10 篇)

#题目一句话要点标签🔗
18 Learning Trajectory-Aware Multimodal Large Language Models for Video Reasoning Segmentation 提出TrajSeg,通过双向文本-轨迹对齐增强MLLM在视频推理分割中的轨迹感知能力。 large language model multimodal
19 Let's Think with Images Efficiently! An Interleaved-Modal Chain-of-Thought Reasoning Framework with Dynamic and Precise Visual Thoughts 提出DaP-ICoT框架,通过动态和精确的视觉信息提升多模态链式推理效率。 multimodal chain-of-thought
20 Repurposing Geometric Foundation Models for Multi-view Diffusion GLD:利用几何基础模型特征空间进行多视角扩散,实现高质量新视角合成 foundation model
21 Adapting Point Cloud Analysis via Multimodal Bayesian Distribution Learning 提出BayesMM,通过多模态贝叶斯分布学习实现点云分析的测试时自适应。 multimodal
22 Exploring Multimodal Prompts For Unsupervised Continuous Anomaly Detection 提出基于多模态Prompt的UCAD框架,提升复杂场景下的异常检测精度。 multimodal
23 VideoDetective: Clue Hunting via both Extrinsic Query and Intrinsic Relevance for Long Video Understanding VideoDetective:通过外部查询和内在关联进行线索挖掘,解决长视频理解难题 large language model multimodal
24 HumanOmni-Speaker: Identifying Who said What and When 提出HumanOmni-Speaker模型,解决多人对话场景下“谁在何时说了什么”的难题 large language model multimodal
25 StreamingClaw Technical Report 提出StreamingClaw,用于实时流视频理解和具身智能的统一框架 multimodal
26 SteelDefectX: A Coarse-to-Fine Vision-Language Dataset and Benchmark for Generalizable Steel Surface Defect Detection SteelDefectX:用于通用钢材表面缺陷检测的粗细粒度视觉-语言数据集与基准 zero-shot transfer
27 Which Concepts to Forget and How to Refuse? Decomposing Concepts for Continual Unlearning in Large Vision-Language Models 提出概念分解的持续遗忘框架,解决视觉-语言模型中不适当拒绝问题 multimodal

🔬 支柱三:空间感知与语义 (Perception & Semantics) (8 篇)

#题目一句话要点标签🔗
28 Group3D: MLLM-Driven Semantic Grouping for Open-Vocabulary 3D Object Detection Group3D:MLLM驱动的语义分组用于开放词汇3D目标检测 open-vocabulary open vocabulary geometric consistency
29 Cross-Instance Gaussian Splatting Registration via Geometry-Aware Feature-Guided Alignment 提出GSA,通过几何感知特征引导对齐实现跨实例3D高斯溅射配准 3D gaussian splatting 3DGS gaussian splatting
30 FreeArtGS: Articulated Gaussian Splatting Under Free-moving Scenario FreeArtGS:提出自由移动场景下可动高斯溅射重建方法 3DGS gaussian splatting splatting
31 GenOpticalFlow: A Generative Approach to Unsupervised Optical Flow Learning GenOpticalFlow:提出一种生成式无监督光流学习框架,无需人工标注。 depth estimation optical flow motion estimation
32 PEARL: Geometry Aligns Semantics for Training-Free Open-Vocabulary Semantic Segmentation 提出PEARL,通过几何对齐语义实现免训练开放词汇语义分割。 open-vocabulary open vocabulary
33 RefracGS: Novel View Synthesis Through Refractive Water Surfaces with 3D Gaussian Ray Tracing RefracGS:通过高斯光线追踪重建折射水面下的新视角 3D gaussian splatting 3DGS gaussian splatting
34 SatGeo-NeRF: Geometrically Regularized NeRF for Satellite Imagery SatGeo-NeRF:针对卫星图像,提出几何正则化的NeRF方法,缓解过拟合导致的几何伪影。 NeRF
35 GTSR: Subsurface Scattering Awared 3D Gaussians for Translucent Surface Reconstruction 提出GTSR:一种基于次表面散射的3D高斯方法,用于半透明物体表面重建 3DGS

🔬 支柱一:机器人控制 (Robot Control) (5 篇)

#题目一句话要点标签🔗
36 DualCoT-VLA: Visual-Linguistic Chain of Thought via Parallel Reasoning for Vision-Language-Action Models 提出DualCoT-VLA,通过并行视觉-语言思维链解决VLA模型在复杂任务中的推理和延迟问题。 manipulation vision-language-action VLA
37 PAM: A Pose-Appearance-Motion Engine for Sim-to-Real HOI Video Generation 提出PAM引擎,统一姿态、外观和运动,实现可控的Sim-to-Real HOI视频生成。 sim-to-real HOI MANO
38 VIGIL: Part-Grounded Structured Reasoning for Generalizable Deepfake Detection VIGIL:基于部件定位的结构化推理,提升深度伪造检测的泛化性 manipulation reinforcement learning large language model
39 ThinkJEPA: Empowering Latent World Models with Large Vision-Language Reasoning Model 提出ThinkJEPA,利用视觉-语言模型增强潜在世界模型,提升长时域预测能力 manipulation world model
40 AdaEdit: Adaptive Temporal and Channel Modulation for Flow-Based Image Editing AdaEdit提出自适应时序和通道调制,提升Flow Matching模型图像编辑质量。 manipulation flow matching

🔬 支柱七:动作重定向 (Motion Retargeting) (4 篇)

#题目一句话要点标签🔗
41 Biophysics-Enhanced Neural Representations for Patient-Specific Respiratory Motion Modeling 提出PRISM-RM,利用生物物理约束的隐式神经表示建模患者特异性呼吸运动 motion estimation motion representation
42 3D-Layout-R1: Structured Reasoning for Language-Instructed Spatial Editing 提出3D-Layout-R1框架,通过结构化推理实现语言指导的空间编辑。 spatial relationship large language model chain-of-thought
43 SpatialBoost: Enhancing Visual Representation through Language-Guided Reasoning SpatialBoost:通过语言引导推理增强视觉表征的空间感知能力 spatial relationship large language model chain-of-thought
44 SARe: Structure-Aware Large-Scale 3D Fragment Reassembly 提出结构感知重组(SARe)框架,解决大规模三维碎片重组中邻接关系推理的难题。 geometric consistency

🔬 支柱四:生成式动作 (Generative Motion) (1 篇)

#题目一句话要点标签🔗
45 UniMotion: A Unified Framework for Motion-Text-Vision Understanding and Generation UniMotion:提出统一框架,实现运动、文本和视觉的理解与生成。 motion latent human motion motion representation

🔬 支柱六:视频提取与匹配 (Video Extraction) (1 篇)

#题目一句话要点标签🔗
46 Timing In stand-up Comedy: Text, Audio, Laughter, Kinesics (TIC-TALK): Pipeline and Database for the Multimodal Study of Comedic Timing TIC-TALK:构建用于喜剧时机多模态研究的文本、音频、姿态数据库与流程 HuMoR multimodal

🔬 支柱八:物理动画 (Physics-based Animation) (1 篇)

#题目一句话要点标签🔗
47 Unified Spatiotemporal Token Compression for Video-LLMs at Ultra-Low Retention 提出统一时空Token压缩方法,在极低保留率下提升Video-LLM性能 spatiotemporal large language model

🔬 支柱五:交互与反应 (Interaction & Reaction) (1 篇)

#题目一句话要点标签🔗
48 Benchmarking Recurrent Event-Based Object Detection for Industrial Multi-Class Recognition on MTEvent 在MTEvent数据集上,基准测试循环事件相机目标检测用于工业多类别识别。 human-object interaction

⬅️ 返回 cs.CV 首页 · 🏠 返回主页