cs.CV(2026-01-21)

📊 共 25 篇论文 | 🔗 3 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (10 🔗3) 支柱三:空间感知与语义 (Perception & Semantics) (5) 支柱二:RL算法与架构 (RL & Architecture) (5) 支柱一:机器人控制 (Robot Control) (3) 支柱四:生成式动作 (Generative Motion) (1) 支柱八:物理动画 (Physics-based Animation) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (10 篇)

#题目一句话要点标签🔗
1 Multimodal system for skin cancer detection 提出一种基于普通照片和元数据的多模态皮肤癌检测系统,提升诊断可及性。 multimodal
2 Iterative Refinement Improves Compositional Image Generation 提出迭代优化框架,利用视觉-语言模型反馈提升组合图像生成质量 large language model chain-of-thought
3 Towards Understanding Best Practices for Quantization of Vision-Language Models 研究视觉-语言模型量化的最佳实践,提升多模态任务效率。 large language model multimodal
4 LiViBench: An Omnimodal Benchmark for Interactive Livestream Video Understanding 提出LiViBench,一个面向交互式直播视频理解的全模态基准测试。 large language model multimodal
5 FunCineForge: A Unified Dataset Toolkit and Model for Zero-Shot Movie Dubbing in Diverse Cinematic Scenes FunCineForge:面向多样化电影场景的零样本电影配音统一工具包与模型 multimodal instruction following
6 HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding 提出HERMES,利用分层KV缓存实现高效流式视频理解 large language model multimodal
7 3D Space as a Scratchpad for Editable Text-to-Image Generation 提出基于3D空间草稿板的可编辑文本到图像生成框架,提升空间推理能力。 large language model chain-of-thought
8 Rethinking Video Generation Model for the Embodied World 针对具身智能,提出RBench机器人视频生成评估基准和RoVid-X大规模数据集。 embodied AI
9 Training-Free and Interpretable Hateful Video Detection via Multi-stage Adversarial Reasoning 提出MARS:一种免训练且可解释的多阶段对抗推理框架,用于检测仇恨视频。 multimodal
10 Symmetry Informative and Agnostic Feature Disentanglement for 3D Shapes 提出对称感知和不可知特征解耦方法,提升3D形状分析性能 foundation model

🔬 支柱三:空间感知与语义 (Perception & Semantics) (5 篇)

#题目一句话要点标签🔗
11 POTR: Post-Training 3DGS Compression POTR:一种用于3D高斯溅射的后训练压缩方法,显著提升推理速度并降低存储需求。 3D gaussian splatting 3DGS gaussian splatting
12 GAT-NeRF: Geometry-Aware-Transformer Enhanced Neural Radiance Fields for High-Fidelity 4D Facial Avatars 提出GAT-NeRF,通过几何感知Transformer增强NeRF,实现高保真4D面部Avatar重建。 NeRF neural radiance field
13 SpatialMem: Unified 3D Memory with Metric Anchoring and Fast Retrieval SpatialMem:提出一种统一的3D记忆系统,用于度量锚定和快速检索。 open-vocabulary open vocabulary egocentric
14 RayRoPE: Projective Ray Positional Encoding for Multi-view Attention RayRoPE:用于多视角注意力机制的射影光线位置编码,提升新视角合成与深度估计。 depth estimation stereo depth
15 ScenDi: 3D-to-2D Scene Diffusion Cascades for Urban Generation ScenDi:结合3D和2D扩散模型的城市场景生成方法 3DGS

🔬 支柱二:RL算法与架构 (RL & Architecture) (5 篇)

#题目一句话要点标签🔗
16 ReinPath: A Multimodal Reinforcement Learning Approach for Pathology 提出ReinPath:一种用于病理学分析的多模态强化学习方法 reinforcement learning large language model multimodal
17 Deep Leakage with Generative Flow Matching Denoiser 提出基于生成流匹配去噪器的深度泄露攻击,提升联邦学习隐私破解效果 flow matching foundation model
18 UBATrack: Spatio-Temporal State Space Model for General Multi-Modal Tracking UBATrack:基于时空状态空间模型的通用多模态目标跟踪框架 Mamba state space model
19 FlowSSC: Universal Generative Monocular Semantic Scene Completion via One-Step Latent Diffusion FlowSSC:基于单步潜在扩散的通用生成式单目语义场景补全 flow matching spatial relationship
20 M2I2HA: A Multi-modal Object Detection Method Based on Intra- and Inter-Modal Hypergraph Attention 提出M2I2HA,利用超图注意力进行多模态目标检测,提升复杂环境下的检测精度。 Mamba SSM state space model

🔬 支柱一:机器人控制 (Robot Control) (3 篇)

#题目一句话要点标签🔗
21 Walk through Paintings: Egocentric World Models from Internet Priors 提出EgoWM,利用互联网视频先验知识构建可控的自中心世界模型 humanoid manipulation world model
22 DrivIng: A Large-Scale Multimodal Driving Dataset with Full Digital Twin Integration DrivIng:一个集成完整数字孪生的大规模多模态自动驾驶数据集 sim-to-real multimodal
23 LuxRemix: Lighting Decomposition and Remixing for Indoor Scenes LuxRemix:提出一种室内场景光照分解与重混合的交互式光照编辑方法 manipulation 3D gaussian splatting gaussian splatting

🔬 支柱四:生成式动作 (Generative Motion) (1 篇)

#题目一句话要点标签🔗
24 Reconstruction-Anchored Diffusion Model for Text-to-Motion Generation 提出重建锚定扩散模型以解决文本到动作生成中的信息缺失问题 motion diffusion model motion diffusion text-to-motion

🔬 支柱八:物理动画 (Physics-based Animation) (1 篇)

#题目一句话要点标签🔗
25 Breaking the accuracy-resource dilemma: a lightweight adaptive video inference enhancement 提出基于模糊控制的自适应视频推理增强框架,解决精度-资源困境 spatiotemporal

⬅️ 返回 cs.CV 首页 · 🏠 返回主页