cs.CV（2026-02-02）

📊 共 55 篇论文 | 🔗 10 篇有代码

🎯 兴趣领域导航

支柱二：RL算法与架构 (RL & Architecture) (21 🔗4) 支柱九：具身大模型 (Embodied Foundation Models) (14 🔗2) 支柱三：空间感知与语义 (Perception & Semantics) (11 🔗3) 支柱一：机器人控制 (Robot Control) (7 🔗1) 支柱四：生成式动作 (Generative Motion) (1) 支柱五：交互与反应 (Interaction & Reaction) (1)

🔬 支柱二：RL算法与架构 (RL & Architecture) (21 篇)

#	题目	一句话要点	标签	🔗
1	Toward Cognitive Supersensing in Multimodal Large Language Model	提出认知超感知训练范式，提升多模态大语言模型在复杂认知任务中的表现。	reinforcement learning open-vocabulary open vocabulary
2	UniDriveDreamer: A Single-Stage Multimodal World Model for Autonomous Driving	UniDriveDreamer：用于自动驾驶的单阶段多模态世界模型	world model dreamer multimodal
3	ClueTracer: Question-to-Vision Clue Tracing for Training-Free Hallucination Suppression in Multimodal Reasoning	提出ClueTracer，无需训练即可抑制多模态推理中的幻觉问题	Eureka multimodal visual grounding
4	DenVisCoM: Dense Vision Correspondence Mamba for Efficient and Real-time Optical Flow and Stereo Estimation	提出DenVisCoM Mamba模块和混合架构，用于高效实时的光流和立体匹配估计	Mamba optical flow	✅
5	VQ-Style: Disentangling Style and Content in Motion with Residual Quantized Representations	提出基于残差量化表示的VQ-Style框架，用于人体运动数据中风格与内容解耦	contrastive learning VQ-VAE human motion
6	Unified Personalized Reward Model for Vision Generation	提出UnifiedReward-Flex，用于提升视觉生成中个性化奖励模型的性能。	reinforcement learning DPO direct preference optimization
7	Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation	Causal Forcing：通过自回归扩散蒸馏实现高质量实时交互视频生成	distillation instruction following	✅
8	One Size, Many Fits: Aligning Diverse Group-Wise Click Preferences in Large-Scale Advertising Image Generation	提出OSMF框架，对齐大规模广告图像生成中不同用户群体的点击偏好。	DPO large language model multimodal	✅
9	Contribution-aware Token Compression for Efficient Video Understanding via Reinforcement Learning	提出CaCoVID，通过强化学习进行贡献感知的Token压缩，提升视频理解效率。	reinforcement learning large language model
10	Enhancing Indoor Occupancy Prediction via Sparse Query-Based Multi-Level Consistent Knowledge Distillation	提出DiScene以解决室内占用预测的效率与准确性问题	distillation feature matching	✅
11	Teacher-Guided Student Self-Knowledge Distillation Using Diffusion Model	提出基于扩散模型的教师引导学生自知识蒸馏方法DSKD，解决教师-学生特征分布差异问题。	teacher-student distillation
12	SMTrack: State-Aware Mamba for Efficient Temporal Modeling in Visual Tracking	提出SMTrack：利用状态感知Mamba模型高效进行视觉跟踪中的时序建模	Mamba state space model
13	Know Your Step: Faster and Better Alignment for Flow Matching Models via Step-aware Advantages	提出TAFS GRPO框架，加速Flow Matching模型对齐人类偏好，提升少步文图生成质量。	reinforcement learning flow matching
14	HandMCM: Multi-modal Point Cloud-based Correspondence State Space Model for 3D Hand Pose Estimation	提出HandMCM，利用多模态点云和Correspondence Mamba解决3D手部姿态估计中的遮挡问题	Mamba state space model
15	Infinite-World: Scaling Interactive World Models to 1000-Frame Horizons via Pose-Free Hierarchical Memory	Infinite-World：通过无姿态分层记忆将交互式世界模型扩展到1000帧	world model
16	LongVPO: From Anchored Cues to Self-Reasoning for Long-Form Video Preference Optimization	LongVPO：通过自推理优化长视频偏好，无需长视频标注。	direct preference optimization large language model
17	GPD: Guided Progressive Distillation for Fast and High-Quality Video Generation	提出引导式渐进蒸馏(GPD)框架，加速高质量视频生成扩散模型。	distillation
18	Fast Autoregressive Video Diffusion and World Models with Temporal Cache Compression and Sparse Attention	提出TempCache、AnnCA和AnnSA，加速自回归视频扩散模型推理并降低显存占用。	world model
19	Research on World Models Is Not Merely Injecting World Knowledge into Specific Tasks	提出世界模型的统一设计规范，克服现有方法在任务上的碎片化。	world model
20	Samba+: General and Accurate Salient Object Detection via A More Unified Mamba-based Framework	提出Samba+，一个基于Mamba的通用显著性目标检测框架，适用于多种SOD任务。	Mamba
21	Rotation-free Online Handwritten Character Recognition Using Linear Recurrent Units	提出基于SW-PS和LRU的无旋转在线手写字符识别框架，提升旋转鲁棒性	SSM state space model

🔬 支柱九：具身大模型 (Embodied Foundation Models) (14 篇)

#	题目	一句话要点	标签	🔗
22	Seeing Is Believing? A Benchmark for Multimodal Large Language Models on Visual Illusions and Anomalies	VIA-Bench：提出视觉错觉与异常基准测试，揭示多模态大语言模型的感知脆弱性	large language model multimodal chain-of-thought
23	Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models	提出VDR-Bench基准，评估多模态大语言模型在视觉文本搜索中的能力。	large language model multimodal	✅
24	Q Cache: Visual Attention is Valuable in Less than Half of Decode Layers for Multimodal Large Language Model	针对多模态大语言模型，提出Q Cache以减少视觉token冗余和KV缓存占用，提升推理效率。	large language model multimodal
25	ObjEmbed: Towards Universal Multimodal Object Embeddings	ObjEmbed：面向通用多模态对象嵌入，实现细粒度视觉语言对齐	multimodal visual grounding
26	SPIRIT: Adapting Vision Foundation Models for Unified Single- and Multi-Frame Infrared Small Target Detection	SPIRIT：自适应视觉基础模型，用于统一的单帧和多帧红外小目标检测	foundation model
27	Simplicity Prevails: The Emergence of Generalizable AIGI Detection in Visual Foundation Models	利用视觉基础模型，通过简单线性分类器实现通用人工智能生成图像检测，显著提升泛化性。	foundation model
28	UV-M3TL: A Unified and Versatile Multimodal Multi-Task Learning Framework for Assistive Driving Perception	提出UV-M3TL框架，用于辅助驾驶感知中的多模态多任务学习，提升性能并缓解任务间负迁移。	multimodal
29	Multimodal UNcommonsense: From Odd to Ordinary and Ordinary to Odd	提出Multimodal UNcommonsense基准，并用R-ICL框架提升模型在异常场景下的常识推理能力。	multimodal
30	Rethinking Genomic Modeling Through Optical Character Recognition	提出OpticalDNA以解决基因组建模中的信息浪费问题	large language model foundation model
31	FreshMem: Brain-Inspired Frequency-Space Hybrid Memory for Streaming Video Understanding	FreshMem：面向流式视频理解的脑启发频率-空间混合记忆网络	large language model multimodal
32	ReCALL: Recalibrating Capability Degradation for MLLM-based Composed Image Retrieval	提出ReCALL框架，解决MLLM用于组合图像检索时的能力退化问题	large language model multimodal
33	Omni-Judge: Can Omni-LLMs Serve as Human-Aligned Judges for Text-Conditioned Audio-Video Generation?	Omni-Judge：探索全模态LLM作为文本条件音视频生成的人类对齐评估器的潜力	large language model chain-of-thought
34	SelvaMask: Segmenting Trees in Tropical Forests and Beyond	SelvaMask：针对热带森林树木分割的新数据集与检测分割框架	foundation model
35	LoopViT: Scaling Visual ARC with Looped Transformers	LoopViT：利用循环Transformer以提升视觉ARC问题的泛化能力	chain-of-thought	✅

🔬 支柱三：空间感知与语义 (Perception & Semantics) (11 篇)

#	题目	一句话要点	标签	🔗
36	Beyond Open Vocabulary: Multimodal Prompting for Object Detection in Remote Sensing Images	提出RS-MPOD，通过多模态Prompting提升遥感图像目标检测的开放词汇泛化能力	open-vocabulary open vocabulary multimodal
37	MAIN-VLA: Modeling Abstraction of Intention and eNvironment for Vision-Language-Action Models	MAIN-VLA：建模意图与环境抽象，提升VLA模型在复杂环境中的决策能力	affordance vision-language-action VLA
38	UrbanGS: A Scalable and Efficient Architecture for Geometrically Accurate Large-Scene Reconstruction	UrbanGS：面向城市级场景，兼顾几何精度、效率与可扩展性的三维重建框架	3D gaussian splatting 3DGS gaussian splatting
39	SurfSplat: Conquering Feedforward 2D Gaussian Splatting with Surface Continuity Priors	SurfSplat：利用表面连续性先验实现前馈2D高斯溅射，提升稀疏图像三维重建质量。	3D gaussian splatting 3DGS gaussian splatting	✅
40	LangMap: A Hierarchical Benchmark for Open-Vocabulary Goal Navigation	提出LangMap：一个用于开放词汇目标导航的分层基准测试。	open-vocabulary open vocabulary	✅
41	FastPhysGS: Accelerating Physics-based Dynamic 3DGS Simulation via Interior Completion and Adaptive Optimization	提出FastPhysGS以加速物理基础的动态3DGS仿真	3D gaussian splatting 3DGS gaussian splatting
42	VRGaussianAvatar: Integrating 3D Gaussian Avatars into VR	VRGaussianAvatar：将3D高斯头像集成到VR中，实现实时全身虚拟化身	3D gaussian splatting 3DGS gaussian splatting
43	Real-Time Loop Closure Detection in Visual SLAM via NetVLAD and Faiss	利用NetVLAD和Faiss加速视觉SLAM中的实时回环检测	visual SLAM
44	CloDS: Visual-Only Unsupervised Cloth Dynamics Learning in Unknown Conditions	提出CloDS，解决未知条件下仅视觉无监督的布料动力学学习问题	gaussian splatting splatting	✅
45	Real-Time 2D LiDAR Object Detection Using Three-Frame RGB Scan Encoding	提出基于三帧RGB扫描编码的实时2D激光雷达目标检测方法，适用于室内服务机器人。	occupancy grid
46	Tail-Aware Post-Training Quantization for 3D Geometry Models	提出TAPTQ以解决3D几何模型量化问题	VGGT

🔬 支柱一：机器人控制 (Robot Control) (7 篇)

#	题目	一句话要点	标签	🔗
47	CIEC: Coupling Implicit and Explicit Cues for Multimodal Weakly Supervised Manipulation Localization	提出CIEC框架，利用弱监督实现多模态图像-文本篡改定位。	manipulation multimodal
48	Fact or Fake? Assessing the Role of Deepfake Detectors in Multimodal Misinformation Detection	评估Deepfake检测器在多模态虚假信息检测中的作用：语义理解与外部证据至关重要	manipulation multimodal
49	DDP-WM: Disentangled Dynamics Prediction for Efficient World Models	DDP-WM：解耦动态预测的高效世界模型，加速机器人自主规划	manipulation MPC world model	✅
50	How Well Do Models Follow Visual Instructions? VIBE: A Systematic Benchmark for Visual Instruction-Driven Image Editing	VIBE：一个用于视觉指令驱动图像编辑的系统性评测基准。	manipulation multimodal instruction following
51	ProxyImg: Towards Highly-Controllable Image Representation via Hierarchical Disentangled Proxy Embedding	ProxyImg：通过分层解耦代理嵌入实现高度可控的图像表示	manipulation implicit representation physically plausible
52	MLV-Edit: Towards Consistent and Highly Efficient Editing for Minute-Level Videos	MLV-Edit：面向分钟级视频的一致且高效的编辑框架	manipulation
53	FlowBypass: Rectified Flow Trajectory Bypass for Training-Free Image Editing	提出FlowBypass，通过校正流轨迹绕过实现免训练图像编辑，提升保真度和对齐性。	manipulation

🔬 支柱四：生成式动作 (Generative Motion) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
54	Superman: Unifying Skeleton and Vision for Human Motion Perception and Generation	Superman：统一骨骼与视觉信息，实现人体运动感知与生成	motion generation motion tokenizer SMPL

🔬 支柱五：交互与反应 (Interaction & Reaction) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
55	Making Avatars Interact: Towards Text-Driven Human-Object Interaction for Controllable Talking Avatars	InteractAvatar：提出双流框架，实现文本驱动的具身人与物交互的 talking avatar 生成	human-object interaction human motion

⬅️ 返回 cs.CV 首页 · 🏠 返回主页

cs.CV（2026-02-02）

🎯 兴趣领域导航

🔬 支柱二：RL算法与架构 (RL & Architecture) (21 篇)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (14 篇)

🔬 支柱三：空间感知与语义 (Perception & Semantics) (11 篇)

🔬 支柱一：机器人控制 (Robot Control) (7 篇)

🔬 支柱四：生成式动作 (Generative Motion) (1 篇)

🔬 支柱五：交互与反应 (Interaction & Reaction) (1 篇)

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理