cs.CV(2026-01-22)

📊 共 24 篇论文 | 🔗 2 篇有代码

🎯 兴趣领域导航

支柱二:RL算法与架构 (RL & Architecture) (9 🔗1) 支柱九:具身大模型 (Embodied Foundation Models) (5) 支柱三:空间感知与语义 (Perception & Semantics) (4) 支柱八:物理动画 (Physics-based Animation) (2 🔗1) 支柱七:动作重定向 (Motion Retargeting) (1) 支柱五:交互与反应 (Interaction & Reaction) (1) 支柱六:视频提取与匹配 (Video Extraction) (1) 支柱一:机器人控制 (Robot Control) (1)

🔬 支柱二:RL算法与架构 (RL & Architecture) (9 篇)

#题目一句话要点标签🔗
1 Keyframe-Based Feed-Forward Visual Odometry 提出基于强化学习的关键帧前馈视觉里程计,提升效率与精度。 reinforcement learning visual odometry VGGT
2 Understanding the Transfer Limits of Vision Foundation Models 研究视觉基础模型迁移学习的局限性,强调预训练目标与下游任务对齐的重要性 MAE contrastive learning foundation model
3 PhysicsMind: Sim and Real Mechanics Benchmarking for Physical Reasoning and Prediction in Foundational VLMs and World Models PhysicsMind:用于评估VLMs和世界模型物理推理能力的综合性基准测试 world model large language model multimodal
4 Evolving Without Ending: Unifying Multimodal Incremental Learning for Continual Panoptic Perception 提出CPP模型,统一多模态增量学习,实现持续全景感知 distillation multimodal
5 Explainable Deepfake Detection with RL Enhanced Self-Blended Images 提出基于强化学习增强的自混合图像可解释Deepfake检测方法 reinforcement learning large language model multimodal
6 DSFedMed: Dual-Scale Federated Medical Image Segmentation via Mutual Distillation Between Foundation and Lightweight Models DSFedMed:通过基础模型与轻量级模型互蒸馏实现双尺度联邦医学图像分割 distillation foundation model
7 Clustering-Guided Spatial-Spectral Mamba for Hyperspectral Image Classification 提出CSSMamba,利用聚类引导的空间-光谱Mamba网络进行高光谱图像分类。 Mamba HSI
8 HVD: Human Vision-Driven Video Representation Learning for Text-Video Retrieval 提出HVD模型,通过模拟人类视觉认知机制提升文本-视频检索性能 representation learning
9 NeuroMamba: Multi-Perspective Feature Interaction with Visual Mamba for Neuron Segmentation NeuroMamba:利用视觉Mamba的多视角特征交互进行神经元分割 Mamba

🔬 支柱九:具身大模型 (Embodied Foundation Models) (5 篇)

#题目一句话要点标签🔗
10 Opening the Black Box: Preliminary Insights into Affective Modeling in Multimodal Foundation Models 提出系统性研究以揭示多模态基础模型中的情感建模机制 foundation model multimodal
11 Beyond Visual Safety: Jailbreaking Multimodal Large Language Models for Harmful Image Generation via Semantic-Agnostic Inputs 提出BVS框架,通过语义无关输入破解多模态大语言模型有害图像生成限制 large language model multimodal
12 Rethinking Composed Image Retrieval Evaluation: A Fine-Grained Benchmark from Image Editing 提出EDIR:一个基于图像编辑的细粒度组合图像检索评测基准。 multimodal
13 VideoThinker: Building Agentic VideoLLMs with LLM-Guided Tool Reasoning VideoThinker:构建基于LLM引导工具推理的Agentic视频大语言模型 large language model
14 Zero-Shot Product Attribute Labeling with Vision-Language Models: A Three-Tier Evaluation Framework 提出三层评估框架,利用视觉-语言模型实现零样本产品属性标注 multimodal

🔬 支柱三:空间感知与语义 (Perception & Semantics) (4 篇)

#题目一句话要点标签🔗
15 ThermoSplat: Cross-Modal 3D Gaussian Splatting with Feature Modulation and Geometry Decoupling ThermoSplat:基于特征调制和几何解耦的跨模态3D高斯溅射重建 3D gaussian splatting 3DGS gaussian splatting
16 EVolSplat4D: Efficient Volume-based Gaussian Splatting for 4D Urban Scene Synthesis EVolSplat4D:高效的体素化高斯溅射方法,用于4D城市场景合成 3D gaussian splatting gaussian splatting splatting
17 LL-GaussianImage: Efficient Image Representation for Zero-shot Low-Light Enhancement with 2D Gaussian Splatting 提出LL-GaussianImage,用于在2D高斯溅射压缩域内进行零样本弱光增强。 gaussian splatting splatting
18 LL-GaussianMap: Zero-shot Low-Light Image Enhancement via 2D Gaussian Splatting Guided Gain Maps 提出LL-GaussianMap,利用2D高斯溅射引导增益图实现零样本弱光图像增强。 gaussian splatting splatting

🔬 支柱八:物理动画 (Physics-based Animation) (2 篇)

#题目一句话要点标签🔗
19 Region-aware Spatiotemporal Modeling with Collaborative Domain Generalization for Cross-Subject EEG Emotion Recognition 提出基于区域感知的时空建模与协同领域泛化框架,用于跨被试脑电情绪识别。 spatiotemporal
20 PyraTok: Language-Aligned Pyramidal Tokenizer for Video Understanding and Generation PyraTok:用于视频理解和生成的语言对齐金字塔式分词器 spatiotemporal zero-shot transfer

🔬 支柱七:动作重定向 (Motion Retargeting) (1 篇)

#题目一句话要点标签🔗
21 Masked Modeling for Human Motion Recovery Under Occlusions 提出MoRo:一种基于掩码建模的遮挡鲁棒人体运动恢复框架 human motion

🔬 支柱五:交互与反应 (Interaction & Reaction) (1 篇)

#题目一句话要点标签🔗
22 Skywork UniPic 3.0: Unified Multi-Image Composition via Sequence Modeling Skywork UniPic 3.0:提出基于序列建模的统一多图合成框架,实现高质量图像融合。 human-object interaction HOI multimodal

🔬 支柱六:视频提取与匹配 (Video Extraction) (1 篇)

#题目一句话要点标签🔗
23 Event-VStream: Event-Driven Real-Time Understanding for Long Video Streams Event-VStream:事件驱动的长视频流实时理解框架 Ego4D large language model multimodal

🔬 支柱一:机器人控制 (Robot Control) (1 篇)

#题目一句话要点标签🔗
24 DTP: A Simple yet Effective Distracting Token Pruning Framework for Vision-Language Action Models 提出DTP框架,通过剪枝干扰token提升视觉-语言动作模型在机器人操作任务中的成功率。 manipulation VLA

⬅️ 返回 cs.CV 首页 · 🏠 返回主页