cs.CV(2026-01-14)

📊 共 27 篇论文 | 🔗 2 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (10 🔗2) 支柱三:空间感知与语义 (Perception & Semantics) (7) 支柱二:RL算法与架构 (RL & Architecture) (4) 支柱一:机器人控制 (Robot Control) (2) 支柱八:物理动画 (Physics-based Animation) (2) 支柱五:交互与反应 (Interaction & Reaction) (1) 支柱七:动作重定向 (Motion Retargeting) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (10 篇)

#题目一句话要点标签🔗
1 Vision Foundation Models for Domain Generalisable Cross-View Localisation in Planetary Ground-Aerial Robotic Teams 提出基于视觉基础模型的跨视角定位方法,用于行星地空机器人协同。 foundation model
2 LP-LLM: End-to-End Real-World Degraded License Plate Text Recognition via Large Multimodal Models 提出LP-LLM,通过大模型端到端解决真实场景下退化车牌识别问题 multimodal
3 CogRail: Benchmarking VLMs in Cognitive Intrusion Perception for Intelligent Railway Transportation Systems CogRail:构建铁路入侵认知感知基准,并提出联合微调框架提升VLM性能 foundation model multimodal
4 Video-MSR: Benchmarking Multi-hop Spatial Reasoning Capabilities of MLLMs Video-MSR:首个动态视频多步空间推理能力评测基准 large language model multimodal
5 See More, Store Less: Memory-Efficient Resolution for Video Moment Retrieval 提出SMORE框架以解决视频时刻检索中的内存效率问题 large language model multimodal
6 Multi-Modal LLM based Image Captioning in ICT: Bridging the Gap Between General and Industry Domain 提出多阶段渐进式训练的ICT领域图像描述模型,提升领域知识理解能力 large language model multimodal
7 Hot-Start from Pixels: Low-Resolution Visual Tokens for Chinese Language Modeling 提出基于低分辨率像素的中文语言建模方法,有效利用汉字视觉信息。 large language model
8 Beyond the final layer: Attentive multilayer fusion for vision transformers 提出基于注意力机制的多层融合方法,提升Vision Transformer线性探测性能 foundation model
9 PhyRPR: Training-Free Physics-Constrained Video Generation 提出PhyRPR以解决物理约束下视频生成问题 multimodal
10 Towards Open Environments and Instructions: General Vision-Language Navigation via Fast-Slow Interactive Reasoning 提出Slow4fast-VLN,通过快速-慢速交互推理实现通用视觉语言导航 VLN

🔬 支柱三:空间感知与语义 (Perception & Semantics) (7 篇)

#题目一句话要点标签🔗
11 OpenVoxel: Training-Free Grouping and Captioning Voxels for Open-Vocabulary 3D Scene Understanding 提出OpenVoxel,一种免训练的三维场景体素分组与描述算法,用于开放词汇场景理解。 scene understanding open-vocabulary open vocabulary
12 GaussianFluent: Gaussian Simulation for Dynamic Scenes with Mixed Materials GaussianFluent:提出基于高斯分布的动态场景混合材质断裂模拟与渲染框架 3D gaussian splatting 3DGS gaussian splatting
13 SpikeVAEDiff: Neural Spike-based Natural Visual Scene Reconstruction via VD-VAE and Versatile Diffusion SpikeVAEDiff:结合VD-VAE和扩散模型,从神经元脉冲数据重建自然视觉场景 scene reconstruction
14 Affostruction: 3D Affordance Grounding with Generative Reconstruction Affostruction:提出基于生成式重建的3D可供性基准方法,解决可见表面局限性问题。 affordance
15 A$^2$TG: Adaptive Anisotropic Textured Gaussians for Efficient 3D Scene Representation 提出自适应各向异性纹理高斯表示A$^2$TG,提升3D场景渲染效率。 gaussian splatting splatting
16 SCE-SLAM: Scale-Consistent Monocular SLAM via Scene Coordinate Embeddings SCE-SLAM:通过场景坐标嵌入实现尺度一致的单目SLAM visual SLAM
17 V-DPM: 4D Video Reconstruction with Dynamic Point Maps V-DPM:利用动态点图实现4D视频重建,无需后处理优化。 VGGT

🔬 支柱二:RL算法与架构 (RL & Architecture) (4 篇)

#题目一句话要点标签🔗
18 GRCF: Two-Stage Groupwise Ranking and Calibration Framework for Multimodal Sentiment Analysis 提出GRCF框架,通过分组排序和校准解决多模态情感分析中的标签噪声和排序偏差问题 MAE HuMoR multimodal
19 STEP3-VL-10B Technical Report STEP3-VL-10B:一种轻量级多模态基础模型,通过高效预训练和强化学习实现卓越性能。 reinforcement learning foundation model multimodal
20 LPCAN: Lightweight Pyramid Cross-Attention Network for Rail Surface Defect Detection Using RGB-D Data 提出LPCAN以解决铁路表面缺陷检测中的高复杂度问题 MAE multimodal
21 CLIDD: Cross-Layer Independent Deformable Description for Efficient and Discriminative Local Feature Representation 提出CLIDD,通过跨层独立形变描述实现高效且具区分性的局部特征表示,适用于机器人导航等空间智能任务。 distillation feature matching

🔬 支柱一:机器人控制 (Robot Control) (2 篇)

#题目一句话要点标签🔗
22 Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning Fast-ThinkAct:通过可解释的隐式规划实现高效的视觉-语言-动作推理 manipulation policy learning vision-language-action
23 Sim2real Image Translation Enables Viewpoint-Robust Policies from Fixed-Camera Datasets 提出MANGO,通过图像转换实现固定视角数据集的视角鲁棒机器人策略 manipulation sim2real imitation learning

🔬 支柱八:物理动画 (Physics-based Animation) (2 篇)

#题目一句话要点标签🔗
24 Exploring Reliable Spatiotemporal Dependencies for Efficient Visual Tracking STDTrack:探索可靠时空依赖,提升轻量级视觉跟踪性能 spatiotemporal
25 Image2Garment: Simulation-ready Garment Generation from a Single Image 提出Image2Garment以解决单图生成物理准确服装问题 differentiable simulation

🔬 支柱五:交互与反应 (Interaction & Reaction) (1 篇)

#题目一句话要点标签🔗
26 GlovEgo-HOI: Bridging the Synthetic-to-Real Gap for Industrial Egocentric Human-Object Interaction Detection 提出GlovEgo-HOI数据集和GlovEgo-Net模型,解决工业场景下EHOI检测数据稀缺问题。 human-object interaction HOI egocentric

🔬 支柱七:动作重定向 (Motion Retargeting) (1 篇)

#题目一句话要点标签🔗
27 Efficient Camera-Controlled Video Generation of Static Scenes via Sparse Diffusion and 3D Rendering SRENDER:利用稀疏扩散和3D渲染实现高效相机控制的静态场景视频生成 geometric consistency embodied AI

⬅️ 返回 cs.CV 首页 · 🏠 返回主页