cs.CV(2026-02-09)

📊 共 35 篇论文 | 🔗 7 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (15 🔗4) 支柱二:RL算法与架构 (RL & Architecture) (9 🔗1) 支柱三:空间感知与语义 (Perception & Semantics) (6 🔗1) 支柱四:生成式动作 (Generative Motion) (2 🔗1) 支柱八:物理动画 (Physics-based Animation) (2) 支柱一:机器人控制 (Robot Control) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (15 篇)

#题目一句话要点标签🔗
1 Chain-of-Caption: Training-free improvement of multimodal large language model on referring expression comprehension 提出Chain-of-Caption框架,无需训练即可提升多模态大语言模型在指代表达理解任务上的性能。 large language model multimodal chain-of-thought
2 From Correspondence to Actions: Human-Like Multi-Image Spatial Reasoning in Multi-modal Large Language Models 提出HATCH框架,提升多模态大语言模型在多视角空间推理中的人类相似性 large language model multimodal
3 TiFRe: Text-guided Video Frame Reduction for Efficient Video Multi-modal Large Language Models 提出TiFRe框架,通过文本引导的视频帧减少提升Video MLLM效率 large language model
4 Multimodal Learning for Arcing Detection in Pantograph-Catenary Systems 提出MultiDeepSAD多模态学习框架,用于受电弓-接触网系统中电弧故障检测。 multimodal
5 Zero-shot System for Automatic Body Region Detection for Volumetric CT and MR Images 提出零样本方法,利用预训练模型自动检测CT/MR图像中的身体区域 large language model foundation model multimodal
6 OneVision-Encoder: Codec-Aligned Sparsity as a Foundational Principle for Multimodal Intelligence OneVision-Encoder:编解码器对齐的稀疏性作为多模态智能的基础原则 multimodal
7 GeoFocus: Blending Efficient Global-to-Local Perception for Multimodal Geometry Problem-Solving GeoFocus:融合全局到局部高效感知的多模态几何问题求解框架 multimodal
8 Are Vision Foundation Models Foundational for Electron Microscopy Image Segmentation? 研究视觉基础模型在电子显微镜图像分割中的适用性,揭示跨数据集泛化难题 foundation model
9 A Unified Framework for Multimodal Image Reconstruction and Synthesis using Denoising Diffusion Models Any2all:基于去噪扩散模型的统一多模态图像重建与合成框架 multimodal
10 Vista: Scene-Aware Optimization for Streaming Video Question Answering under Post-Hoc Queries 提出Vista框架以解决流媒体视频问答中的场景感知问题 large language model multimodal
11 Omni-Video 2: Scaling MLLM-Conditioned Diffusion for Unified Video Generation and Editing Omni-Video 2:扩展MLLM条件扩散模型,实现统一的视频生成与编辑 multimodal
12 MOVA: Towards Scalable and Synchronized Video-Audio Generation MOVA:面向可扩展和同步的视频-音频生成模型 multimodal
13 TimeChat-Captioner: Scripting Multi-Scene Videos with Time-Aware and Structural Audio-Visual Captions 提出Omni Dense Captioning以生成时间感知的多场景视频描述 TAMP
14 ALIVE: Animate Your World with Lifelike Audio-Video Generation ALIVE:通过逼真的音视频生成技术,赋予世界生机 foundation model
15 Improving Reconstruction of Representation Autoencoder 提出LV-RAE,通过增强低层信息和优化解码器,提升表征自编码器的图像重建和生成质量。 foundation model

🔬 支柱二:RL算法与架构 (RL & Architecture) (9 篇)

#题目一句话要点标签🔗
16 Geospatial-Reasoning-Driven Vocabulary-Agnostic Remote Sensing Semantic Segmentation 提出GR-CoT框架,利用地理空间推理增强遥感语义分割的开放词汇识别能力。 distillation scene understanding open-vocabulary
17 Any-to-All MRI Synthesis: A Unified Foundation Model for Nasopharyngeal Carcinoma and Its Downstream Applications 提出用于鼻咽癌MRI任意模态合成的统一基础模型,提升放疗规划精度。 representation learning VLA foundation model
18 UrbanGraphEmbeddings: Learning and Evaluating Spatially Grounded Multimodal Embeddings for Urban Science 提出UGE,通过空间图嵌入提升城市环境多模态表征学习 contrastive learning multimodal
19 When and How Much to Imagine: Adaptive Test-Time Scaling with World Models for Visual Spatial Reasoning AVIC:基于世界模型的自适应测试时缩放框架,提升视觉空间推理效率与可靠性 world model large language model multimodal
20 VideoVeritas: AI-Generated Video Detection via Perception Pretext Reinforcement Learning VideoVeritas:通过感知预训练强化学习检测AI生成视频 reinforcement learning spatiotemporal large language model
21 WorldCompass: Reinforcement Learning for Long-Horizon World Models WorldCompass:强化学习后训练长时域交互视频世界模型,提升探索能力 reinforcement learning world model
22 WorldArena: A Unified Benchmark for Evaluating Perception and Functional Utility of Embodied World Models WorldArena:用于评估具身世界模型感知和功能效用的一体化基准 world model embodied AI
23 Demo-ICL: In-Context Learning for Procedural Video Knowledge Acquisition 提出Demo-ICL,用于程序性视频知识获取的上下文学习。 direct preference optimization large language model multimodal
24 SemiNFT: Learning to Transfer Presets from Imitation to Appreciation via Hybrid-Sample Reinforcement Learning SemiNFT:通过混合样本强化学习,实现从模仿到欣赏的预设迁移 reinforcement learning

🔬 支柱三:空间感知与语义 (Perception & Semantics) (6 篇)

#题目一句话要点标签🔗
25 Analysis of Converged 3D Gaussian Splatting Solutions: Density Effects and Prediction Limit 分析3D高斯溅射收敛解:揭示密度效应与预测极限,优化训练鲁棒性。 3D gaussian splatting 3DGS gaussian splatting
26 MotionCrafter: Dense Geometry and Motion Reconstruction with a 4D VAE 提出MotionCrafter以解决单目视频中的4D几何与运动重建问题 scene flow motion reconstruction
27 Rotated Lights for Consistent and Efficient 2D Gaussians Inverse Rendering 提出RotLight:一种基于旋转光照的2D高斯反向渲染方法,提升albedo估计精度。 gaussian splatting splatting neural radiance field
28 Grow with the Flow: 4D Reconstruction of Growing Plants with Gaussian Flow Fields 提出3D高斯流场模型以解决植物生长的4D重建问题 gaussian splatting splatting
29 FLAG-4D: Flow-Guided Local-Global Dual-Deformation Model for 4D Reconstruction FLAG-4D:提出一种流动引导的局部-全局双重形变模型用于动态场景的4D重建。 optical flow
30 Thegra: Graph-based SLAM for Thermal Imagery Thegra:面向热成像的图优化SLAM系统,提升恶劣环境下的定位精度 visual SLAM

🔬 支柱四:生成式动作 (Generative Motion) (2 篇)

#题目一句话要点标签🔗
31 Language-Guided Transformer Tokenizer for Human Motion Generation 提出语言引导的Transformer Tokenizer用于高效的人体动作生成 motion generation human motion human motion generation
32 TriC-Motion: Tri-Domain Causal Modeling Grounded Text-to-Motion Generation 提出TriC-Motion,融合时空频域信息并进行因果干预的文本到动作生成框架 text-to-motion motion generation

🔬 支柱八:物理动画 (Physics-based Animation) (2 篇)

#题目一句话要点标签🔗
33 Tighnari v2: Mitigating Label Noise and Distribution Shift in Multimodal Plant Distribution Prediction via Mixture of Experts and Weakly Supervised Learning Tighnari v2:通过混合专家模型和弱监督学习缓解多模态植物分布预测中的标签噪声和分布偏移 spatiotemporal multimodal
34 MVAnimate: Enhancing Character Animation with Multi-View Optimization MVAnimate:多视角优化增强角色动画生成质量 character animation

🔬 支柱一:机器人控制 (Robot Control) (1 篇)

#题目一句话要点标签🔗
35 SynSacc: A Blender-to-V2E Pipeline for Synthetic Neuromorphic Eye-Movement Data and Sim-to-Real Spiking Model Training SynSacc:用于神经形态眼动数据的Blender-to-V2E合成管线及SNN模型训练 sim-to-real

⬅️ 返回 cs.CV 首页 · 🏠 返回主页