cs.CV(2026-02-26)

📊 共 52 篇论文 | 🔗 12 篇有代码

🎯 兴趣领域导航

支柱二:RL算法与架构 (RL & Architecture) (15 🔗5) 支柱三:空间感知与语义 (Perception & Semantics) (14 🔗2) 支柱九:具身大模型 (Embodied Foundation Models) (14 🔗5) 支柱一:机器人控制 (Robot Control) (4) 支柱四:生成式动作 (Generative Motion) (3) 支柱八:物理动画 (Physics-based Animation) (2)

🔬 支柱二:RL算法与架构 (RL & Architecture) (15 篇)

#题目一句话要点标签🔗
1 SPATIALALIGN: Aligning Dynamic Spatial Relationships in Video Generation SPATIALALIGN:通过自提升框架增强文本到视频生成模型对动态空间关系的建模能力 DPO direct preference optimization spatial relationship
2 MediX-R1: Open Ended Medical Reinforcement Learning MediX-R1:用于医学多模态大语言模型的开放式强化学习框架 reinforcement learning large language model multimodal
3 From Blind Spots to Gains: Diagnostic-Driven Iterative Training for Large Multimodal Models 提出诊断驱动的迭代训练方法DPE,提升大型多模态模型在开放任务上的持续学习能力 reinforcement learning multimodal
4 A data- and compute-efficient chest X-ray foundation model beyond aggressive scaling 提出CheXficient,通过主动数据选择,高效构建胸部X光影像基础模型。 representation learning foundation model
5 ProjFlow: Projection Sampling with Flow Matching for Zero-Shot Exact Spatial Motion Control ProjFlow:基于Flow Matching的投影采样实现零样本精确空间运动控制 flow matching human motion
6 SpectralMamba-UNet: Frequency-Disentangled State Space Modeling for Texture-Structure Consistent Medical Image Segmentation 提出SpectralMamba-UNet,通过频域解耦建模实现纹理结构一致的医学图像分割。 Mamba state space model
7 MSJoE: Jointly Evolving MLLM and Sampler for Efficient Long-Form Video Understanding 提出MSJoE,联合优化MLLM和采样器,高效理解长视频 reinforcement learning large language model multimodal
8 GeoWorld: Geometric World Models GeoWorld:通过双曲几何世界模型提升多步视觉规划性能 reinforcement learning world model
9 SPMamba-YOLO: An Underwater Object Detection Network Based on Multi-Scale Feature Enhancement and Global Context Modeling SPMamba-YOLO:融合多尺度特征增强与全局上下文建模的水下目标检测网络 Mamba state space model
10 ManifoldGD: Training-Free Hierarchical Manifold Guidance for Diffusion-Based Dataset Distillation 提出ManifoldGD,一种基于流形引导的无训练扩散数据集蒸馏方法。 distillation
11 WARM-CAT: : Warm-Started Test-Time Comprehensive Knowledge Accumulation for Compositional Zero-Shot Learning 提出WARM-CAT,通过测试时知识累积解决组合零样本学习中的分布偏移问题。 representation learning multimodal
12 UCM: Unifying Camera Control and Memory with Time-aware Positional Encoding Warping for World Models UCM:基于时间感知位置编码扭曲统一相机控制与记忆的世界模型 world model
13 SPATIALALIGN: Aligning Dynamic Spatial Relationships in Video Generation SPATIALALIGN:提升文本生成视频中动态空间关系对齐能力的自提升框架 DPO direct preference optimization spatial relationship
14 Few-Shot Continual Learning for 3D Brain MRI with Frozen Foundation Models 提出冻结基础模型与LoRA模块结合的少样本持续学习方法 MAE foundation model
15 WARM-CAT: Warm-Started Test-Time Comprehensive Knowledge Accumulation for Compositional Zero-Shot Learning WARM-CAT:面向组合零样本学习的Warm-Started测试时全面知识积累 representation learning multimodal

🔬 支柱三:空间感知与语义 (Perception & Semantics) (14 篇)

#题目一句话要点标签🔗
16 GIFSplat: Generative Prior-Guided Iterative Feed-Forward 3D Gaussian Splatting from Sparse Views GIFSplat:基于生成先验的迭代式前馈3D高斯溅射,从稀疏视角重建。 3D gaussian splatting gaussian splatting splatting
17 Latent Gaussian Splatting for 4D Panoptic Occupancy Tracking 提出LaGS,用于动态环境中基于相机的4D全景占据跟踪。 gaussian splatting splatting scene understanding
18 GSTurb: Gaussian Splatting for Atmospheric Turbulence Mitigation GSTurb:利用高斯溅射缓解大气湍流引起的图像退化 gaussian splatting splatting optical flow
19 Monocular Open Vocabulary Occupancy Prediction for Indoor Scenes 提出基于单目视觉的室内场景开放词汇占据预测方法,提升复杂环境理解能力。 splatting open-vocabulary open vocabulary
20 Retrieve and Segment: Are a Few Examples Enough to Bridge the Supervision Gap in Open-Vocabulary Segmentation? 提出检索增强的测试时适配器,以少量样本弥合开放词汇分割的监督差距。 open-vocabulary open vocabulary
21 Pix2Key: Controllable Open-Vocabulary Retrieval with Semantic Decomposition and Self-Supervised Visual Dictionary Learning Pix2Key提出基于语义分解和自监督视觉字典学习的可控开放词汇图像检索方法 open-vocabulary open vocabulary
22 BetterScene: 3D Scene Synthesis with Representation-Aligned Generative Model 提出BetterScene以解决稀疏照片下的新视角合成问题 3D gaussian splatting 3DGS gaussian splatting
23 Instruction-based Image Editing with Planning, Reasoning, and Generation 提出基于规划、推理和生成的指令驱动图像编辑方法,提升复杂场景下的编辑质量。 scene understanding large language model chain-of-thought
24 SwiftNDC: Fast Neural Depth Correction for High-Fidelity 3D Reconstruction SwiftNDC:用于高保真3D重建的快速神经深度校正 3D gaussian splatting 3DGS gaussian splatting
25 Motion-aware Event Suppression for Event Cameras 提出运动感知事件抑制框架,实时过滤事件相机中由独立运动物体和自运动引起的事件。 visual odometry IMoS
26 PackUV: Packed Gaussian UV Maps for 4D Volumetric Video PackUV:提出基于UV图的紧凑型高斯表示,用于高效存储和传输4D体积视频 gaussian splatting splatting
27 FLIGHT: Fibonacci Lattice-based Inference for Geometric Heading in real-Time FLIGHT:基于斐波那契格点的单目视频几何朝向实时推断 visual odometry
28 Velocity and stroke rate reconstruction of canoe sprint team boats based on panned and zoomed video recordings 提出基于平移缩放视频的皮划艇速度和划桨频率重建框架,无需船载传感器。 optical flow
29 Motion-aware Event Suppression for Event Cameras 提出运动感知事件抑制框架,实时过滤事件相机中由独立运动物体和自运动产生的事件。 visual odometry IMoS

🔬 支柱九:具身大模型 (Embodied Foundation Models) (14 篇)

#题目一句话要点标签🔗
30 Large Multimodal Models as General In-Context Classifiers 提出CIRCLE方法,提升大模型在开放世界分类中的上下文学习能力。 multimodal
31 AgentVista: Evaluating Multimodal Agents in Ultra-Challenging Realistic Visual Scenarios AgentVista:提出一个超高难度真实视觉场景下的多模态Agent评估基准。 multimodal
32 Efficient Encoder-Free Fourier-based 3D Large Multimodal Model 提出Fase3D,一种高效无编码器的傅里叶3D大模型,用于处理大规模点云场景。 multimodal
33 MM-NeuroOnco: A Multimodal Benchmark and Instruction Dataset for MRI-Based Brain Tumor Diagnosis 提出MM-NeuroOnco多模态脑肿瘤MRI诊断基准与指令数据集,促进临床可解释的诊断推理。 multimodal
34 Towards Multimodal Domain Generalization with Few Labels 提出一种半监督多模态域泛化框架,解决少标签下的跨域多模态学习问题 multimodal
35 Asymmetric Idiosyncrasies in Multimodal Models 研究多模态模型中的不对称特性,揭示文本到图像生成中风格信息损失问题 multimodal
36 Plug, Play, and Fortify: A Low-Cost Module for Robust Multimodal Image Understanding Models 提出多模态权重分配模块,增强多模态图像理解模型在模态缺失下的鲁棒性 multimodal
37 Cytoarchitecture in Words: Weakly Supervised Vision-Language Modeling for Human Brain Microscopy 提出一种弱监督视觉-语言建模方法,用于人脑显微图像的细胞结构分析。 large language model foundation model
38 Can Agents Distinguish Visually Hard-to-Separate Diseases in a Zero-Shot Setting? A Pilot Study 提出基于对比裁决的多智能体框架,用于区分视觉上难以区分的疾病。 large language model multimodal
39 SoPE: Spherical Coordinate-Based Positional Embedding for Enhancing Spatial Perception of 3D LVLMs 提出基于球坐标的位置编码SoPE,增强3D LVLMs的空间感知能力 large language model multimodal
40 ThinkOmni: Lifting Textual Reasoning to Omni-modal Scenarios via Guidance Decoding ThinkOmni:通过引导解码将文本推理能力提升到全模态场景 large language model
41 SubspaceAD: Training-Free Few-Shot Anomaly Detection via Subspace Modeling SubspaceAD:基于子空间建模的免训练少样本异常检测方法 foundation model
42 HulluEdit: Single-Pass Evidence-Consistent Subspace Editing for Mitigating Hallucinations in Large Vision-Language Models 提出HulluEdit以解决大型视觉语言模型中的幻觉问题 visual grounding
43 Scaling Audio-Visual Quality Assessment Dataset via Crowdsourcing 提出基于众包的音视频质量评估数据集构建方法,并发布YT-NTU-AVQ数据集。 multimodal

🔬 支柱一:机器人控制 (Robot Control) (4 篇)

#题目一句话要点标签🔗
44 EmbodMocap: In-the-Wild 4D Human-Scene Reconstruction for Embodied Agents EmbodMocap:提出一种基于双iPhone的便携式4D人-场景重建方法,用于具身智能体。 humanoid humanoid robot sim-to-real
45 Risk-Aware World Model Predictive Control for Generalizable End-to-End Autonomous Driving 提出风险感知世界模型预测控制(RaWMPC),解决端到端自动驾驶泛化性问题。 model predictive control imitation learning world model
46 ArtPro: Self-Supervised Articulated Object Reconstruction with Adaptive Integration of Mobility Proposals ArtPro:自监督铰接物体重建,自适应融合运动提议 manipulation 3D gaussian splatting gaussian splatting
47 OSDaR-AR: Enhancing Railway Perception Datasets via Multi-modal Augmented Reality OSDaR-AR:通过多模态增强现实技术提升铁路感知数据集质量 sim-to-real

🔬 支柱四:生成式动作 (Generative Motion) (3 篇)

#题目一句话要点标签🔗
48 Causal Motion Diffusion Models for Autoregressive Motion Generation 提出因果运动扩散模型(CMDM),用于解决自回归运动生成中的不稳定性与延迟问题。 motion diffusion model motion diffusion text-to-motion
49 DyaDiT: A Multi-Modal Diffusion Transformer for Socially Favorable Dyadic Gesture Generation DyaDiT:用于生成符合社会规范的双人对话手势的多模态扩散Transformer motion generation human motion
50 DisQ-HNet: A Disentangled Quantized Half-UNet for Interpretable Multimodal Image Synthesis Applications to Tau-PET Synthesis from T1 and FLAIR MRI 提出DisQ-HNet,用于从MRI合成Tau-PET并解析模态贡献 VQ-VAE multimodal

🔬 支柱八:物理动画 (Physics-based Animation) (2 篇)

#题目一句话要点标签🔗
51 Spatio-Temporal Token Pruning for Efficient High-Resolution GUI Agents GUIPruner:针对高分辨率GUI代理的时空Token剪枝,提升效率并保持性能。 spatiotemporal
52 Multidimensional Task Learning: A Unified Tensor Framework for Computer Vision Tasks 提出基于张量的多维任务学习框架,统一解决计算机视觉中的分类、分割和检测任务。 spatiotemporal

⬅️ 返回 cs.CV 首页 · 🏠 返回主页