cs.CV(2026-01-09)

📊 共 21 篇论文 | 🔗 4 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (8 🔗3) 支柱二:RL算法与架构 (RL & Architecture) (7 🔗1) 支柱三:空间感知与语义 (Perception & Semantics) (3) 支柱一:机器人控制 (Robot Control) (3)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (8 篇)

#题目一句话要点标签🔗
1 Towards Generalized Multi-Image Editing for Unified Multimodal Models 提出一种可扩展的多图像编辑框架,用于统一多模态模型,提升跨图像一致性和泛化能力。 multimodal
2 What's Left Unsaid? Detecting and Correcting Misleading Omissions in Multimodal News Previews 提出OMGuard,通过解读和修正新闻预览中的误导性省略,提升多模态新闻理解。 multimodal
3 One Language-Free Foundation Model Is Enough for Universal Vision Anomaly Detection UniADet:一种通用的、无语言依赖的视觉异常检测基础模型 foundation model
4 Enabling Stroke-Level Structural Analysis of Hieroglyphic Scripts without Language-Specific Priors 提出HieroSA,无需语言先验知识实现象形文字笔画级结构分析 large language model multimodal
5 Orient Anything V2: Unifying Orientation and Rotation Understanding Orient Anything V2:统一物体3D方向与旋转理解的基础模型 foundation model
6 VIB-Probe: Detecting and Mitigating Hallucinations in Vision-Language Models via Variational Information Bottleneck 提出VIB-Probe,通过变分信息瓶颈检测并缓解视觉-语言模型中的幻觉问题。 multimodal
7 MMViR: A Multi-Modal and Multi-Granularity Representation for Long-range Video Understanding 提出MMViR,用于长视频多模态多粒度表示,提升长视频理解性能。 large language model
8 ROAP: A Reading-Order and Attention-Prior Pipeline for Optimizing Layout Transformers in Key Information Extraction 提出ROAP流水线,通过阅读顺序和注意力先验优化版面Transformer,提升关键信息抽取效果 multimodal

🔬 支柱二:RL算法与架构 (RL & Architecture) (7 篇)

#题目一句话要点标签🔗
9 SceneAlign: Aligning Multimodal Reasoning to Scene Graphs in Complex Visual Scenes SceneAlign:通过场景图对齐多模态推理,提升复杂视觉场景下的推理忠实性。 direct preference optimization large language model multimodal
10 LayerGS: Decomposition and Inpainting of Layered 3D Human Avatars via 2D Gaussian Splatting 提出LayerGS,通过2D高斯溅射分解和修复分层3D人体Avatar,实现高质量虚拟试穿。 distillation gaussian splatting splatting
11 LatentVLA: Efficient Vision-Language Models for Autonomous Driving via Latent Action Prediction LatentVLA:基于自监督隐空间动作预测的高效自动驾驶视觉-语言模型 distillation vision-language-action VLA
12 SketchVL: Policy Optimization via Fine-Grained Credit Assignment for Chart Understanding and More 提出SketchVL,通过细粒度信用分配优化策略,提升图表理解能力。 reinforcement learning large language model multimodal
13 Boosting Latent Diffusion Models via Disentangled Representation Alignment 提出Send-VAE,通过解耦表示对齐提升潜在扩散模型的生成质量与训练效率。 representation learning classifier-free guidance foundation model
14 Adaptive Disentangled Representation Learning for Incomplete Multi-View Multi-Label Classification 提出自适应解耦表示学习(ADRL)方法,解决不完整多视图多标签分类问题。 representation learning
15 Compressing image encoders via latent distillation 提出基于潜在空间蒸馏的图像编码器压缩方法,适用于资源受限场景 distillation

🔬 支柱三:空间感知与语义 (Perception & Semantics) (3 篇)

#题目一句话要点标签🔗
16 FeatureSLAM: Feature-enriched 3D gaussian splatting SLAM in real time FeatureSLAM:实时特征增强的3D高斯溅射SLAM系统 3D gaussian splatting 3DGS gaussian splatting
17 GS-DMSR: Dynamic Sensitive Multi-scale Manifold Enhancement for Accelerated High-Quality 3D Gaussian Splatting GS-DMSR:动态敏感多尺度流形增强加速高质量3D高斯溅射 3D gaussian splatting gaussian splatting splatting
18 GeoSurDepth: Spatial Geometry-Consistent Self-Supervised Depth Estimation for Surround-View Cameras GeoSurDepth:面向环视相机的空间几何一致性自监督深度估计 depth estimation scene understanding foundation model

🔬 支柱一:机器人控制 (Robot Control) (3 篇)

#题目一句话要点标签🔗
19 GaussianSwap: Animatable Video Face Swapping with 3D Gaussian Splatting GaussianSwap:基于3D高斯溅射的可动画视频人脸替换框架 manipulation 3D gaussian splatting gaussian splatting
20 SceneFoundry: Generating Interactive Infinite 3D Worlds SceneFoundry:提出一种语言引导的扩散框架,用于生成可交互的无限3D场景。 manipulation embodied AI
21 Goal Force: Teaching Video Models To Accomplish Physics-Conditioned Goals Goal Force:提出基于力向量的视频生成模型,实现物理条件下的目标导向控制 manipulation world model

⬅️ 返回 cs.CV 首页 · 🏠 返回主页