cs.CV(2026-01-20)

📊 共 33 篇论文 | 🔗 11 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (17 🔗6) 支柱三:空间感知与语义 (Perception & Semantics) (7 🔗2) 支柱二:RL算法与架构 (RL & Architecture) (6) 支柱八:物理动画 (Physics-based Animation) (1 🔗1) 支柱七:动作重定向 (Motion Retargeting) (1 🔗1) 支柱四:生成式动作 (Generative Motion) (1 🔗1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (17 篇)

#题目一句话要点标签🔗
1 FantasyVLN: Unified Multimodal Chain-of-Thought Reasoning for Vision-Language Navigation 提出FantasyVLN,用于视觉语言导航中统一的多模态链式思考推理,提升效率与性能。 VLA VLN multimodal
2 Vision Also You Need: Navigating Out-of-Distribution Detection with Multimodal Large Language Model 提出MM-OOD以解决图像空间的OOD检测问题 large language model multimodal
3 LLM Augmented Intervenable Multimodal Adaptor for Post-operative Complication Prediction in Lung Cancer Surgery MIRACLE:融合临床与影像数据,可干预的LLM增强多模态适配器,用于肺癌术后并发症预测。 large language model multimodal
4 Weather-R1: Logically Consistent Reinforcement Fine-Tuning for Multimodal Reasoning in Meteorology 提出LoCo-RFT,解决气象领域多模态推理中逻辑不一致问题,并构建Weather-R1模型。 multimodal
5 Scaling Test-time Inference for Visual Grounding 提出EGM:通过扩展测试时计算量提升视觉定位小模型的性能与效率。 visual grounding
6 The Side Effects of Being Smart: Safety Risks in MLLMs' Multi-Image Reasoning 提出MIR-SafetyBench,揭示多图推理能力增强的大语言模型安全风险。 large language model multimodal
7 Fine-Grained Zero-Shot Composed Image Retrieval with Complementary Visual-Semantic Integration 提出CVSI模型,通过互补的视觉-语义融合实现细粒度零样本组合图像检索 large language model multimodal
8 XD-MAP: Cross-Modal Domain Adaptation using Semantic Parametric Mapping 提出XD-MAP,利用语义参数化映射实现图像到LiDAR的跨模态领域自适应 foundation model
9 OmniTransfer: All-in-one Framework for Spatio-temporal Video Transfer OmniTransfer:用于时空视频迁移的统一框架,提升视频生成灵活性和保真度 multimodal
10 OCCAM: Class-Agnostic, Training-Free, Prior-Free and Multi-Class Object Counting 提出OCCAM,一种免训练、无先验、类别无关的多类别物体计数方法。 foundation model
11 Insight: Interpretable Semantic Hierarchies in Vision-Language Encoders Insight:在视觉-语言编码器中构建可解释的语义层级结构 foundation model
12 HiT: History-Injection Transformers for Onboard Continuous Flood Change Detection 提出历史注入Transformer(HiT),用于星载连续洪水变化检测,实现实时灾害评估。 foundation model
13 Hierarchical Long Video Understanding with Audiovisual Entity Cohesion and Agentic Search 提出HAVEN框架,通过视听实体关联和Agent搜索实现层级长视频理解 multimodal
14 Reasoning or Pattern Matching? Probing Large Vision-Language Models with Visual Puzzles 利用视觉谜题探究大型视觉语言模型的推理能力,揭示其模式匹配局限性 multimodal
15 VIAFormer: Voxel-Image Alignment Transformer for High-Fidelity Voxel Refinement 提出VIAFormer,用于多视角图像引导下的高保真体素精细化 foundation model
16 Face-Voice Association with Inductive Bias for Maximum Class Separation 提出基于归纳偏置的最大类间分离人脸-语音关联方法 multimodal
17 ChartVerse: Scaling Chart Reasoning via Reliable Programmatic Synthesis from Scratch ChartVerse:通过可靠的程序化从零合成,扩展图表推理能力 chain-of-thought

🔬 支柱三:空间感知与语义 (Perception & Semantics) (7 篇)

#题目一句话要点标签🔗
18 Learning Fine-Grained Correspondence with Cross-Perspective Perception for Open-Vocabulary 6D Object Pose Estimation 提出FiCoP框架,通过跨视角感知和细粒度匹配,提升开放词汇6D物体姿态估计的鲁棒性。 open-vocabulary open vocabulary
19 Rig-Aware 3D Reconstruction of Vehicle Undercarriages using Gaussian Splatting 提出一种基于高斯溅射的车辆底盘三维重建方法,用于提升检测效率和买家信心。 gaussian splatting splatting
20 OmniOVCD: Streamlining Open-Vocabulary Change Detection with SAM 3 提出OmniOVCD框架,利用SAM进行开放词汇变化检测,实现SOTA性能。 open-vocabulary open vocabulary
21 One-Shot Refiner: Boosting Feed-forward Novel View Synthesis via One-Step Diffusion 提出One-Shot Refiner,通过单步扩散提升前馈式新视角合成质量 3D gaussian splatting 3DGS gaussian splatting
22 ParkingTwin: Training-Free Streaming 3D Reconstruction for Parking-Lot Digital Twins ParkingTwin:无需训练的停车场数字孪生在线3D重建 3D gaussian splatting 3DGS gaussian splatting
23 Vision-Based Natural Language Scene Understanding for Autonomous Driving: An Extended Dataset and a New Model for Traffic Scene Description Generation 提出一种基于视觉的自然语言场景理解框架,用于自动驾驶交通场景描述生成。 scene understanding
24 Two-Stream temporal transformer for video action classification 提出双流时序Transformer,用于视频动作分类,提升时空信息利用率。 optical flow

🔬 支柱二:RL算法与架构 (RL & Architecture) (6 篇)

#题目一句话要点标签🔗
25 Revisiting Multi-Task Visual Representation Learning 提出MTV多任务视觉预训练框架,融合视觉-语言和自监督学习优势,提升空间推理能力。 representation learning MAE visual pre-training
26 PAS-Mamba: Phase-Amplitude-Spatial State Space Model for MRI Reconstruction 提出PAS-Mamba模型,解耦相位-幅度信息,提升MRI重建质量 Mamba state space model
27 Glance-or-Gaze: Incentivizing LMMs to Adaptively Focus Search via Reinforcement Learning 提出Glance-or-Gaze框架,通过强化学习自适应地聚焦搜索,提升LMMs在知识密集型视觉问答中的性能。 reinforcement learning multimodal
28 DIS2: Disentanglement Meets Distillation with Classwise Attention for Robust Remote Sensing Segmentation under Missing Modalities 提出DIS2框架,通过解耦和蒸馏学习,提升遥感图像分割在模态缺失下的鲁棒性。 distillation multimodal
29 Gaussian Based Adaptive Multi-Modal 3D Semantic Occupancy Prediction 提出基于高斯模型的自适应多模态3D语义占据预测方法,提升自动驾驶安全性。 Mamba state space model multimodal
30 ASBA: A-line State Space Model and B-line Attention for Sparse Optical Doppler Tomography Reconstruction 提出ASBA网络,利用A线状态空间模型和B线注意力机制,实现稀疏光多普勒断层扫描重建。 state space model

🔬 支柱八:物理动画 (Physics-based Animation) (1 篇)

#题目一句话要点标签🔗
31 Facial Spatiotemporal Graphs: Leveraging the 3D Facial Surface for Remote Physiological Measurement 提出面部时空图STGraph,用于利用3D面部表面进行远程生理信号测量 spatiotemporal

🔬 支柱七:动作重定向 (Motion Retargeting) (1 篇)

#题目一句话要点标签🔗
32 Interp3D: Correspondence-aware Interpolation for Generative Textured 3D Morphing Interp3D:提出一种基于对应关系的插值方法,用于生成具有纹理的3D形变 geometric consistency

🔬 支柱四:生成式动作 (Generative Motion) (1 篇)

#题目一句话要点标签🔗
33 Motion 3-to-4: 3D Motion Reconstruction for 4D Synthesis Motion 3-to-4:单目视频驱动的4D动态物体高质量合成框架 motion latent

⬅️ 返回 cs.CV 首页 · 🏠 返回主页