cs.CV(2026-02-02)

📊 共 55 篇论文 | 🔗 10 篇有代码

🎯 兴趣领域导航

支柱二:RL算法与架构 (RL & Architecture) (21 🔗4) 支柱九:具身大模型 (Embodied Foundation Models) (14 🔗2) 支柱三:空间感知与语义 (Perception & Semantics) (11 🔗3) 支柱一:机器人控制 (Robot Control) (7 🔗1) 支柱四:生成式动作 (Generative Motion) (1) 支柱五:交互与反应 (Interaction & Reaction) (1)

🔬 支柱二:RL算法与架构 (RL & Architecture) (21 篇)

#题目一句话要点标签🔗
1 Toward Cognitive Supersensing in Multimodal Large Language Model 提出认知超感知训练范式,提升多模态大语言模型在复杂认知任务中的表现。 reinforcement learning open-vocabulary open vocabulary
2 UniDriveDreamer: A Single-Stage Multimodal World Model for Autonomous Driving UniDriveDreamer:用于自动驾驶的单阶段多模态世界模型 world model dreamer multimodal
3 ClueTracer: Question-to-Vision Clue Tracing for Training-Free Hallucination Suppression in Multimodal Reasoning 提出ClueTracer,无需训练即可抑制多模态推理中的幻觉问题 Eureka multimodal visual grounding
4 DenVisCoM: Dense Vision Correspondence Mamba for Efficient and Real-time Optical Flow and Stereo Estimation 提出DenVisCoM Mamba模块和混合架构,用于高效实时的光流和立体匹配估计 Mamba optical flow
5 VQ-Style: Disentangling Style and Content in Motion with Residual Quantized Representations 提出基于残差量化表示的VQ-Style框架,用于人体运动数据中风格与内容解耦 contrastive learning VQ-VAE human motion
6 Unified Personalized Reward Model for Vision Generation 提出UnifiedReward-Flex,用于提升视觉生成中个性化奖励模型的性能。 reinforcement learning DPO direct preference optimization
7 Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation Causal Forcing:通过自回归扩散蒸馏实现高质量实时交互视频生成 distillation instruction following
8 One Size, Many Fits: Aligning Diverse Group-Wise Click Preferences in Large-Scale Advertising Image Generation 提出OSMF框架,对齐大规模广告图像生成中不同用户群体的点击偏好。 DPO large language model multimodal
9 Contribution-aware Token Compression for Efficient Video Understanding via Reinforcement Learning 提出CaCoVID,通过强化学习进行贡献感知的Token压缩,提升视频理解效率。 reinforcement learning large language model
10 Enhancing Indoor Occupancy Prediction via Sparse Query-Based Multi-Level Consistent Knowledge Distillation 提出DiScene以解决室内占用预测的效率与准确性问题 distillation feature matching
11 Teacher-Guided Student Self-Knowledge Distillation Using Diffusion Model 提出基于扩散模型的教师引导学生自知识蒸馏方法DSKD,解决教师-学生特征分布差异问题。 teacher-student distillation
12 SMTrack: State-Aware Mamba for Efficient Temporal Modeling in Visual Tracking 提出SMTrack:利用状态感知Mamba模型高效进行视觉跟踪中的时序建模 Mamba state space model
13 Know Your Step: Faster and Better Alignment for Flow Matching Models via Step-aware Advantages 提出TAFS GRPO框架,加速Flow Matching模型对齐人类偏好,提升少步文图生成质量。 reinforcement learning flow matching
14 HandMCM: Multi-modal Point Cloud-based Correspondence State Space Model for 3D Hand Pose Estimation 提出HandMCM,利用多模态点云和Correspondence Mamba解决3D手部姿态估计中的遮挡问题 Mamba state space model
15 Infinite-World: Scaling Interactive World Models to 1000-Frame Horizons via Pose-Free Hierarchical Memory Infinite-World:通过无姿态分层记忆将交互式世界模型扩展到1000帧 world model
16 LongVPO: From Anchored Cues to Self-Reasoning for Long-Form Video Preference Optimization LongVPO:通过自推理优化长视频偏好,无需长视频标注。 direct preference optimization large language model
17 GPD: Guided Progressive Distillation for Fast and High-Quality Video Generation 提出引导式渐进蒸馏(GPD)框架,加速高质量视频生成扩散模型。 distillation
18 Fast Autoregressive Video Diffusion and World Models with Temporal Cache Compression and Sparse Attention 提出TempCache、AnnCA和AnnSA,加速自回归视频扩散模型推理并降低显存占用。 world model
19 Research on World Models Is Not Merely Injecting World Knowledge into Specific Tasks 提出世界模型的统一设计规范,克服现有方法在任务上的碎片化。 world model
20 Samba+: General and Accurate Salient Object Detection via A More Unified Mamba-based Framework 提出Samba+,一个基于Mamba的通用显著性目标检测框架,适用于多种SOD任务。 Mamba
21 Rotation-free Online Handwritten Character Recognition Using Linear Recurrent Units 提出基于SW-PS和LRU的无旋转在线手写字符识别框架,提升旋转鲁棒性 SSM state space model

🔬 支柱九:具身大模型 (Embodied Foundation Models) (14 篇)

#题目一句话要点标签🔗
22 Seeing Is Believing? A Benchmark for Multimodal Large Language Models on Visual Illusions and Anomalies VIA-Bench:提出视觉错觉与异常基准测试,揭示多模态大语言模型的感知脆弱性 large language model multimodal chain-of-thought
23 Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models 提出VDR-Bench基准,评估多模态大语言模型在视觉文本搜索中的能力。 large language model multimodal
24 Q Cache: Visual Attention is Valuable in Less than Half of Decode Layers for Multimodal Large Language Model 针对多模态大语言模型,提出Q Cache以减少视觉token冗余和KV缓存占用,提升推理效率。 large language model multimodal
25 ObjEmbed: Towards Universal Multimodal Object Embeddings ObjEmbed:面向通用多模态对象嵌入,实现细粒度视觉语言对齐 multimodal visual grounding
26 SPIRIT: Adapting Vision Foundation Models for Unified Single- and Multi-Frame Infrared Small Target Detection SPIRIT:自适应视觉基础模型,用于统一的单帧和多帧红外小目标检测 foundation model
27 Simplicity Prevails: The Emergence of Generalizable AIGI Detection in Visual Foundation Models 利用视觉基础模型,通过简单线性分类器实现通用人工智能生成图像检测,显著提升泛化性。 foundation model
28 UV-M3TL: A Unified and Versatile Multimodal Multi-Task Learning Framework for Assistive Driving Perception 提出UV-M3TL框架,用于辅助驾驶感知中的多模态多任务学习,提升性能并缓解任务间负迁移。 multimodal
29 Multimodal UNcommonsense: From Odd to Ordinary and Ordinary to Odd 提出Multimodal UNcommonsense基准,并用R-ICL框架提升模型在异常场景下的常识推理能力。 multimodal
30 Rethinking Genomic Modeling Through Optical Character Recognition 提出OpticalDNA以解决基因组建模中的信息浪费问题 large language model foundation model
31 FreshMem: Brain-Inspired Frequency-Space Hybrid Memory for Streaming Video Understanding FreshMem:面向流式视频理解的脑启发频率-空间混合记忆网络 large language model multimodal
32 ReCALL: Recalibrating Capability Degradation for MLLM-based Composed Image Retrieval 提出ReCALL框架,解决MLLM用于组合图像检索时的能力退化问题 large language model multimodal
33 Omni-Judge: Can Omni-LLMs Serve as Human-Aligned Judges for Text-Conditioned Audio-Video Generation? Omni-Judge:探索全模态LLM作为文本条件音视频生成的人类对齐评估器的潜力 large language model chain-of-thought
34 SelvaMask: Segmenting Trees in Tropical Forests and Beyond SelvaMask:针对热带森林树木分割的新数据集与检测分割框架 foundation model
35 LoopViT: Scaling Visual ARC with Looped Transformers LoopViT:利用循环Transformer以提升视觉ARC问题的泛化能力 chain-of-thought

🔬 支柱三:空间感知与语义 (Perception & Semantics) (11 篇)

#题目一句话要点标签🔗
36 Beyond Open Vocabulary: Multimodal Prompting for Object Detection in Remote Sensing Images 提出RS-MPOD,通过多模态Prompting提升遥感图像目标检测的开放词汇泛化能力 open-vocabulary open vocabulary multimodal
37 MAIN-VLA: Modeling Abstraction of Intention and eNvironment for Vision-Language-Action Models MAIN-VLA:建模意图与环境抽象,提升VLA模型在复杂环境中的决策能力 affordance vision-language-action VLA
38 UrbanGS: A Scalable and Efficient Architecture for Geometrically Accurate Large-Scene Reconstruction UrbanGS:面向城市级场景,兼顾几何精度、效率与可扩展性的三维重建框架 3D gaussian splatting 3DGS gaussian splatting
39 SurfSplat: Conquering Feedforward 2D Gaussian Splatting with Surface Continuity Priors SurfSplat:利用表面连续性先验实现前馈2D高斯溅射,提升稀疏图像三维重建质量。 3D gaussian splatting 3DGS gaussian splatting
40 LangMap: A Hierarchical Benchmark for Open-Vocabulary Goal Navigation 提出LangMap:一个用于开放词汇目标导航的分层基准测试。 open-vocabulary open vocabulary
41 FastPhysGS: Accelerating Physics-based Dynamic 3DGS Simulation via Interior Completion and Adaptive Optimization 提出FastPhysGS以加速物理基础的动态3DGS仿真 3D gaussian splatting 3DGS gaussian splatting
42 VRGaussianAvatar: Integrating 3D Gaussian Avatars into VR VRGaussianAvatar:将3D高斯头像集成到VR中,实现实时全身虚拟化身 3D gaussian splatting 3DGS gaussian splatting
43 Real-Time Loop Closure Detection in Visual SLAM via NetVLAD and Faiss 利用NetVLAD和Faiss加速视觉SLAM中的实时回环检测 visual SLAM
44 CloDS: Visual-Only Unsupervised Cloth Dynamics Learning in Unknown Conditions 提出CloDS,解决未知条件下仅视觉无监督的布料动力学学习问题 gaussian splatting splatting
45 Real-Time 2D LiDAR Object Detection Using Three-Frame RGB Scan Encoding 提出基于三帧RGB扫描编码的实时2D激光雷达目标检测方法,适用于室内服务机器人。 occupancy grid
46 Tail-Aware Post-Training Quantization for 3D Geometry Models 提出TAPTQ以解决3D几何模型量化问题 VGGT

🔬 支柱一:机器人控制 (Robot Control) (7 篇)

#题目一句话要点标签🔗
47 CIEC: Coupling Implicit and Explicit Cues for Multimodal Weakly Supervised Manipulation Localization 提出CIEC框架,利用弱监督实现多模态图像-文本篡改定位。 manipulation multimodal
48 Fact or Fake? Assessing the Role of Deepfake Detectors in Multimodal Misinformation Detection 评估Deepfake检测器在多模态虚假信息检测中的作用:语义理解与外部证据至关重要 manipulation multimodal
49 DDP-WM: Disentangled Dynamics Prediction for Efficient World Models DDP-WM:解耦动态预测的高效世界模型,加速机器人自主规划 manipulation MPC world model
50 How Well Do Models Follow Visual Instructions? VIBE: A Systematic Benchmark for Visual Instruction-Driven Image Editing VIBE:一个用于视觉指令驱动图像编辑的系统性评测基准。 manipulation multimodal instruction following
51 ProxyImg: Towards Highly-Controllable Image Representation via Hierarchical Disentangled Proxy Embedding ProxyImg:通过分层解耦代理嵌入实现高度可控的图像表示 manipulation implicit representation physically plausible
52 MLV-Edit: Towards Consistent and Highly Efficient Editing for Minute-Level Videos MLV-Edit:面向分钟级视频的一致且高效的编辑框架 manipulation
53 FlowBypass: Rectified Flow Trajectory Bypass for Training-Free Image Editing 提出FlowBypass,通过校正流轨迹绕过实现免训练图像编辑,提升保真度和对齐性。 manipulation

🔬 支柱四:生成式动作 (Generative Motion) (1 篇)

#题目一句话要点标签🔗
54 Superman: Unifying Skeleton and Vision for Human Motion Perception and Generation Superman:统一骨骼与视觉信息,实现人体运动感知与生成 motion generation motion tokenizer SMPL

🔬 支柱五:交互与反应 (Interaction & Reaction) (1 篇)

#题目一句话要点标签🔗
55 Making Avatars Interact: Towards Text-Driven Human-Object Interaction for Controllable Talking Avatars InteractAvatar:提出双流框架,实现文本驱动的具身人与物交互的 talking avatar 生成 human-object interaction human motion

⬅️ 返回 cs.CV 首页 · 🏠 返回主页