cs.CV(2026-02-12)

📊 共 32 篇论文 | 🔗 8 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (10 🔗2) 支柱二:RL算法与架构 (RL & Architecture) (8 🔗4) 支柱三:空间感知与语义 (Perception & Semantics) (4) 支柱一:机器人控制 (Robot Control) (4 🔗1) 支柱四:生成式动作 (Generative Motion) (2 🔗1) 支柱七:动作重定向 (Motion Retargeting) (2) 支柱六:视频提取与匹配 (Video Extraction) (1) 支柱八:物理动画 (Physics-based Animation) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (10 篇)

#题目一句话要点标签🔗
1 UniT: Unified Multimodal Chain-of-Thought Test-time Scaling 提出UniT,通过多模态思维链测试时扩展提升统一模型的推理能力。 multimodal chain-of-thought
2 Mask What Matters: Mitigating Object Hallucinations in Multimodal Large Language Models with Object-Aligned Visual Contrastive Decoding 提出对象对齐视觉对比解码,缓解多模态大语言模型中的对象幻觉问题 large language model multimodal
3 Spatial Chain-of-Thought: Bridging Understanding and Generation Models for Spatial Reasoning Generation 提出空间思维链(SCoT)框架,提升扩散模型在空间推理生成任务上的性能。 large language model multimodal chain-of-thought
4 ScalSelect: Scalable Training-Free Multimodal Data Selection for Efficient Visual Instruction Tuning 提出ScalSelect以解决大规模多模态数据选择效率问题 multimodal
5 A Large Language Model for Disaster Structural Reconnaissance Summarization 提出基于LLM的灾后结构快速勘察总结框架,提升灾后重建效率。 large language model
6 Adapting Vision-Language Models for E-commerce Understanding at Scale 针对电商场景,提出一种有效适配视觉-语言模型的大规模方法。 multimodal instruction following
7 LLM-Driven 3D Scene Generation of Agricultural Simulation Environments 提出基于LLM的模块化流程,用于生成农业模拟环境的3D场景 large language model
8 DreamID-Omni: Unified Framework for Controllable Human-Centric Audio-Video Generation DreamID-Omni:统一可控的以人为中心的音视频生成框架 foundation model
9 U-Net with Hadamard Transform and DCT Latent Spaces for Next-day Wildfire Spread Prediction 提出TD-FusionUNet,利用变换域融合进行轻量级次日野火蔓延预测 multimodal
10 Vascular anatomy-aware self-supervised pre-training for X-ray angiogram analysis 提出VasoMIM血管解剖感知自监督预训练框架,提升X光血管造影图像分析性能 foundation model

🔬 支柱二:RL算法与架构 (RL & Architecture) (8 篇)

#题目一句话要点标签🔗
11 Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception 提出区域到图像蒸馏方法,提升多模态大模型在细粒度感知任务上的性能。 distillation large language model multimodal
12 Best of Both Worlds: Multimodal Reasoning and Generation via Unified Discrete Flow Matching UniDFlow:统一离散流匹配框架,实现多模态推理与生成 flow matching multimodal
13 DeepGen 1.0: A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing DeepGen 1.0:轻量级统一多模态模型,提升图像生成与编辑能力 reinforcement learning multimodal
14 FAIL: Flow Matching Adversarial Imitation Learning for Image Generation 提出Flow Matching对抗模仿学习(FAIL)用于图像生成,无需显式奖励或成对比较。 imitation learning flow matching
15 WorldTree: Towards 4D Dynamic Worlds from Monocular Video using Tree-Chains WorldTree:提出基于树链的单目视频四维动态世界重建框架 SAC motion representation spatiotemporal
16 RI-Mamba: Rotation-Invariant Mamba for Robust Text-to-Shape Retrieval 提出RI-Mamba,解决任意方向下大规模三维形状检索难题 Mamba contrastive learning
17 PosterOmni: Generalized Artistic Poster Creation via Task Distillation and Unified Reward Feedback PosterOmni:通过任务蒸馏和统一奖励反馈实现通用艺术海报创作 distillation
18 STVG-R1: Incentivizing Instance-Level Reasoning and Grounding in Videos via Reinforcement Learning 提出STVG-R1,通过强化学习激励视频实例级推理和定位,解决视觉-语言模型中的幻觉问题。 reinforcement learning

🔬 支柱三:空间感知与语义 (Perception & Semantics) (4 篇)

#题目一句话要点标签🔗
19 What if Agents Could Imagine? Reinforcing Open-Vocabulary HOI Comprehension through Generation 提出ImagineAgent,通过生成式想象增强开放词汇人-物交互理解 open-vocabulary open vocabulary human-object interaction
20 GSO-SLAM: Bidirectionally Coupled Gaussian Splatting and Direct Visual Odometry GSO-SLAM:双向耦合高斯溅射与直接视觉里程计的实时稠密SLAM系统 visual odometry gaussian splatting splatting
21 TG-Field: Geometry-Aware Radiative Gaussian Fields for Tomographic Reconstruction 提出TG-Field,解决CT重建中稀疏视角和动态运动下的伪影问题 3D gaussian splatting 3DGS gaussian splatting
22 Efficient Segment Anything with Depth-Aware Fusion and Limited Training Data 提出深度感知融合的轻量级RGB-D分割框架,提升小样本下Segment Anything模型的性能。 monocular depth

🔬 支柱一:机器人控制 (Robot Control) (4 篇)

#题目一句话要点标签🔗
23 GigaBrain-0.5M*: a VLA That Learns From World Model-Based Reinforcement Learning GigaBrain-0.5M*:基于世界模型的强化学习VLA模型,提升长时程操作任务性能。 manipulation reinforcement learning world model
24 JEPA-VLA: Video Predictive Embedding is Needed for VLA Models JEPA-VLA:利用视频预测嵌入增强视觉-语言-动作模型的机器人操作能力 manipulation contrastive learning vision-language-action
25 Clutt3R-Seg: Sparse-view 3D Instance Segmentation for Language-grounded Grasping in Cluttered Scenes Clutt3R-Seg:面向杂乱场景语言引导抓取的稀疏视角3D实例分割 manipulation open-vocabulary open vocabulary
26 Stroke of Surprise: Progressive Semantic Illusions in Vector Sketching 提出Stroke of Surprise框架,实现矢量草图的渐进式语义错觉生成。 manipulation distillation

🔬 支柱四:生成式动作 (Generative Motion) (2 篇)

#题目一句话要点标签🔗
27 DynaHOI: Benchmarking Hand-Object Interaction for Dynamic Target DynaHOI:针对动态目标的交互,提出手-物交互新基准与在线评估平台。 motion generation HOI spatiotemporal
28 LUVE : Latent-Cascaded Ultra-High-Resolution Video Generation with Dual Frequency Experts LUVE:基于双频专家和潜在级联的超高分辨率视频生成框架 motion generation

🔬 支柱七:动作重定向 (Motion Retargeting) (2 篇)

#题目一句话要点标签🔗
29 EmoSpace: Fine-Grained Emotion Prototype Learning for Immersive Affective Content Generation EmoSpace:通过细粒度情感原型学习实现沉浸式情感内容生成 motion representation
30 TexSpot: 3D Texture Enhancement with Spatially-uniform Point Latent Representation TexSpot:利用空间均匀点潜在表示增强3D纹理 geometric consistency

🔬 支柱六:视频提取与匹配 (Video Extraction) (1 篇)

#题目一句话要点标签🔗
31 Egocentric Gaze Estimation via Neck-Mounted Camera 提出颈部相机注视点估计任务,并构建数据集与评估Transformer模型。 egocentric

🔬 支柱八:物理动画 (Physics-based Animation) (1 篇)

#题目一句话要点标签🔗
32 MonarchRT: Efficient Attention for Real-Time Video Generation MonarchRT:一种高效注意力机制,用于实时视频生成。 spatiotemporal

⬅️ 返回 cs.CV 首页 · 🏠 返回主页