cs.CV(2026-02-04)

📊 共 34 篇论文 | 🔗 11 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (11 🔗3) 支柱二:RL算法与架构 (RL & Architecture) (8 🔗3) 支柱一:机器人控制 (Robot Control) (5 🔗2) 支柱四:生成式动作 (Generative Motion) (2 🔗1) 支柱三:空间感知与语义 (Perception & Semantics) (2 🔗2) 支柱六:视频提取与匹配 (Video Extraction) (2) 支柱八:物理动画 (Physics-based Animation) (2) 支柱五:交互与反应 (Interaction & Reaction) (1) 支柱七:动作重定向 (Motion Retargeting) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (11 篇)

#题目一句话要点标签🔗
1 Vision-aligned Latent Reasoning for Multi-modal Large Language Model 提出Vision-aligned Latent Reasoning (VaLR)以提升多模态大语言模型在复杂推理任务中的性能。 large language model chain-of-thought
2 KVSmooth: Mitigating Hallucination in Multi-modal Large Language Models through Key-Value Smoothing 提出KVSmooth以解决多模态大语言模型中的幻觉问题 large language model multimodal
3 ImmuVis: Hyperconvolutional Foundation Model for Imaging Mass Cytometry ImmuVis:用于成像质谱细胞术的超卷积基础模型,解决通道空间不固定问题。 foundation model
4 OmniRad: A Radiological Foundation Model for Multi-Task Medical Image Analysis OmniRad:面向多任务医学影像分析的放射学基础模型 foundation model
5 Med-MMFL: A Multimodal Federated Learning Benchmark in Healthcare Med-MMFL:医疗多模态联邦学习基准,促进算法公平评估与可复现性研究。 multimodal
6 Self-evolving Embodied AI 提出自进化具身智能,解决现有具身AI在动态开放环境中泛化性不足的问题 embodied AI
7 AGMA: Adaptive Gaussian Mixture Anchors for Prior-Guided Multimodal Human Trajectory Forecasting 提出AGMA:自适应高斯混合锚点,提升先验引导的多模态行人轨迹预测 multimodal
8 JSynFlow: Japanese Synthesised Flowchart Visual Question Answering Dataset built with Large Language Models JSynFlow:利用大型语言模型构建的日语流程图视觉问答数据集 large language model
9 SAR-RAG: ATR Visual Question Answering by Semantic Search, Retrieval, and MLLM Generation 提出SAR-RAG,通过语义搜索和MLLM生成增强SAR图像的自动目标识别。 large language model multimodal
10 S-MUSt3R: Sliding Multi-view 3D Reconstruction S-MUSt3R:滑动多视角3D重建,扩展单目3D重建基础模型至大规模RGB流 foundation model
11 VISTA-Bench: Do Vision-Language Models Really Understand Visualized Text as Well as Pure Text? VISTA-Bench:评估视觉语言模型对图像中可视化文本的理解能力 multimodal

🔬 支柱二:RL算法与架构 (RL & Architecture) (8 篇)

#题目一句话要点标签🔗
12 Nix and Fix: Targeting 1000x Compression of 3D Gaussian Splatting with Diffusion Models 提出NiFi,利用扩散模型实现3D高斯溅射的千倍压缩。 distillation 3D gaussian splatting 3DGS
13 Decoupled Hierarchical Distillation for Multimodal Emotion Recognition 提出解耦分层蒸馏框架DHMD,提升多模态情感识别性能 distillation multimodal
14 Understanding Degradation with Vision Language Model 提出DU-VLM,用于理解图像退化并实现高质量图像复原。 reinforcement learning multimodal chain-of-thought
15 Seg-ReSearch: Segmentation with Interleaved Reasoning and External Search 提出Seg-ReSearch,通过交错推理和外部搜索解决语言引导分割中的知识瓶颈。 reward design large language model multimodal
16 Partial Ring Scan: Revisiting Scan Order in Vision State Space Models 提出PRISMamba,通过环形扫描和通道过滤提升Vision SSMs的旋转鲁棒性和效率。 Mamba SSM state space model
17 Interactive Spatial-Frequency Fusion Mamba for Multi-Modal Image Fusion 提出交互式空频融合Mamba网络,用于多模态图像融合,提升信息互补性。 Mamba
18 Multiview Self-Representation Learning across Heterogeneous Views 提出多视角自表示学习方法,解决异构视角下的无监督表征学习问题 representation learning
19 Annotation Free Spacecraft Detection and Segmentation using Vision Language Models 提出一种基于视觉语言模型的无标注航天器检测与分割方法 teacher-student distillation

🔬 支柱一:机器人控制 (Robot Control) (5 篇)

#题目一句话要点标签🔗
20 Natural Language Instructions for Scene-Responsive Human-in-the-Loop Motion Planning in Autonomous Driving using Vision-Language-Action Models 利用视觉-语言-动作模型,实现场景响应式人机协同自动驾驶运动规划 motion planning vision-language-action instruction following
21 AGILE: Hand-Object Interaction Reconstruction from Video via Agentic Generation AGILE:通过Agentic生成从视频中重建手-物交互 manipulation dexterous manipulation contact-aware
22 CoWTracker: Tracking by Warping instead of Correlation CoWTracker:提出一种基于形变的密集点跟踪方法,避免了代价体计算。 manipulation optical flow spatiotemporal
23 SynthVerse: A Large-Scale Diverse Synthetic Dataset for Point Tracking SynthVerse:用于点跟踪的大规模多样化合成数据集 manipulation foundation model
24 When and Where to Attack? Stage-wise Attention-Guided Adversarial Attack on Large Vision Language Models 提出SAGA,一种阶段式注意力引导的视觉语言模型对抗攻击方法 manipulation multimodal

🔬 支柱四:生成式动作 (Generative Motion) (2 篇)

#题目一句话要点标签🔗
25 DiMo: Discrete Diffusion Modeling for Motion Generation and Understanding DiMo:用于运动生成与理解的离散扩散模型,统一文本-运动双向任务。 text-to-motion motion generation motion prediction
26 Laminating Representation Autoencoders for Efficient Diffusion 提出 FlatDINO,通过层叠表示自编码器高效压缩 DINOv2 特征用于扩散模型。 classifier-free guidance

🔬 支柱三:空间感知与语义 (Perception & Semantics) (2 篇)

#题目一句话要点标签🔗
27 VecSet-Edit: Unleashing Pre-trained LRM for Mesh Editing from Single Image VecSet-Edit:利用预训练LRM实现单图像网格编辑 3D gaussian splatting gaussian splatting splatting
28 JOintGS: Joint Optimization of Cameras, Bodies and 3D Gaussians for In-the-Wild Monocular Reconstruction JOintGS:联合优化相机、人体和3D高斯,实现野外单目重建 3DGS splatting scene reconstruction

🔬 支柱六:视频提取与匹配 (Video Extraction) (2 篇)

#题目一句话要点标签🔗
29 Depth-Guided Metric-Aware Temporal Consistency for Monocular Video Human Mesh Recovery 提出深度引导的度量感知时序一致性框架,解决单目视频人体网格重建问题 human mesh recovery
30 Temporal Slowness in Central Vision Drives Semantic Object Learning 利用中心视觉的时间迟缓特性,提升自监督学习的物体语义表征能力 egocentric Ego4D

🔬 支柱八:物理动画 (Physics-based Animation) (2 篇)

#题目一句话要点标签🔗
31 Adaptive 1D Video Diffusion Autoencoder 提出One-DVA,一种自适应一维视频扩散自编码器,解决视频压缩和生成问题。 spatiotemporal
32 HoloEv-Net: Efficient Event-based Action Recognition via Holographic Spatial Embedding and Global Spectral Gating HoloEv-Net:通过全息空间嵌入和全局频谱门控实现高效的基于事件的动作识别 spatiotemporal

🔬 支柱五:交互与反应 (Interaction & Reaction) (1 篇)

#题目一句话要点标签🔗
33 A labeled dataset of simulated phlebotomy procedures for medical AI: polygon annotations for object detection and human-object interaction 构建模拟静脉采血数据集,用于医学AI中物体检测与人机交互研究 human-object interaction

🔬 支柱七:动作重定向 (Motion Retargeting) (1 篇)

#题目一句话要点标签🔗
34 TrajVG: 3D Trajectory-Coupled Visual Geometry Learning TrajVG:提出轨迹耦合视觉几何学习框架,提升多帧3D重建在运动视频中的性能 geometric consistency

⬅️ 返回 cs.CV 首页 · 🏠 返回主页