cs.CV(2026-02-11)

📊 共 28 篇论文 | 🔗 6 篇有代码

🎯 兴趣领域导航

支柱二:RL算法与架构 (RL & Architecture) (10 🔗2) 支柱九:具身大模型 (Embodied Foundation Models) (8 🔗3) 支柱一:机器人控制 (Robot Control) (3) 支柱三:空间感知与语义 (Perception & Semantics) (3 🔗1) 支柱七:动作重定向 (Motion Retargeting) (2) 支柱四:生成式动作 (Generative Motion) (1) 支柱八:物理动画 (Physics-based Animation) (1)

🔬 支柱二:RL算法与架构 (RL & Architecture) (10 篇)

#题目一句话要点标签🔗
1 A Vision-Language Foundation Model for Zero-shot Clinical Collaboration and Automated Concept Discovery in Dermatology DermFM-Zero:用于皮肤科零样本临床协作的视觉-语言基础模型 contrastive learning foundation model multimodal
2 MetaphorStar: Image Metaphor Understanding and Reasoning with End-to-End Visual Reinforcement Learning 提出MetaphorStar,利用端到端视觉强化学习解决图像隐喻理解与推理难题。 reinforcement learning large language model multimodal
3 3DXTalker: Unifying Identity, Lip Sync, Emotion, and Spatial Dynamics in Expressive 3D Talking Avatars 3DXTalker:统一身份、口型同步、情感和空间动态的表达性3D说话头像生成。 flow matching motion generation
4 HII-DPO: Eliminate Hallucination via Accurate Hallucination-Inducing Counterfactual Images 提出HII-DPO,通过对抗图像消除视觉语言模型中的幻觉问题 DPO multimodal
5 Beyond VLM-Based Rewards: Diffusion-Native Latent Reward Modeling 提出DiNa-LRM,一种扩散原生潜在奖励模型,提升扩散模型偏好优化效率。 flow matching preference learning multimodal
6 FastFlow: Accelerating The Generative Flow Matching Models with Bandit Inference FastFlow:利用Bandit推断加速生成流匹配模型 flow matching distillation
7 LaSSM: Efficient Semantic-Spatial Query Decoding via Local Aggregation and State Space Models for 3D Instance Segmentation LaSSM:基于局部聚合与状态空间模型的3D实例分割 SSM state space model
8 Spectral-Spatial Contrastive Learning Framework for Regression on Hyperspectral Data 提出用于高光谱数据回归的光谱-空间对比学习框架,提升模型性能。 representation learning contrastive learning
9 Self-Supervised Image Super-Resolution Quality Assessment based on Content-Free Multi-Model Oriented Representation Learning 提出基于无内容多模型导向表征学习的自监督图像超分辨率质量评估方法 representation learning contrastive learning
10 Dual-End Consistency Model 提出双端一致性模型(DE-CM),解决一致性模型训练不稳定和采样不灵活的问题,实现高效图像生成。 flow matching distillation

🔬 支柱九:具身大模型 (Embodied Foundation Models) (8 篇)

#题目一句话要点标签🔗
11 RSHallu: Dual-Mode Hallucination Evaluation for Remote-Sensing Multimodal Large Language Models with Domain-Tailored Mitigation RSHallu:针对遥感多模态大语言模型的幻觉评估与领域定制缓解 large language model multimodal visual grounding
12 C^2ROPE: Causal Continuous Rotary Positional Encoding for 3D Large Multimodal-Models Reasoning 提出C^2RoPE,解决3D多模态大模型推理中RoPE的位置编码局限性问题 large language model multimodal
13 PhyCritic: Multimodal Critic Models for Physical AI 提出PhyCritic,用于提升物理AI任务中多模态模型的评估和对齐能力。 multimodal
14 Ecological mapping with geospatial foundation models 利用地理空间基础模型进行生态制图研究,TerraMind表现优异 foundation model
15 VideoSTF: Stress-Testing Output Repetition in Video Large Language Models VideoSTF:提出用于评估视频大语言模型中输出重复问题的基准测试框架。 large language model
16 Enhancing Weakly Supervised Multimodal Video Anomaly Detection through Text Guidance 提出文本引导的弱监督多模态视频异常检测框架,提升异常特征表达。 multimodal
17 Chain-of-Look Spatial Reasoning for Dense Surgical Instrument Counting 提出Chain-of-Look空间推理框架,解决密集手术器械计数难题 large language model multimodal
18 TwiFF (Think With Future Frames): A Large-Scale Dataset for Dynamic Visual Reasoning 提出TwiFF,用于动态视觉推理的大规模数据集与模型 multimodal chain-of-thought

🔬 支柱一:机器人控制 (Robot Control) (3 篇)

#题目一句话要点标签🔗
19 HairWeaver: Few-Shot Photorealistic Hair Motion Synthesis with Sim-to-Real Guided Video Diffusion HairWeaver:基于扩散模型的少样本逼真头发运动合成 sim-to-real sim2real motion synthesis
20 Towards Learning a Generalizable 3D Scene Representation from 2D Observations 提出一种可泛化的神经辐射场方法,用于机器人全局工作空间三维重建 humanoid humanoid robot manipulation
21 Chatting with Images for Introspective Visual Thinking 提出ViLaVT,通过图像交互式对话增强视觉语言模型内省式视觉推理能力 manipulation reinforcement learning

🔬 支柱三:空间感知与语义 (Perception & Semantics) (3 篇)

#题目一句话要点标签🔗
22 AugVLA-3D: Depth-Driven Feature Augmentation for Vision-Language-Action Models AugVLA-3D:基于深度驱动特征增强的视觉-语言-动作模型 depth estimation VGGT vision-language-action
23 PuriLight: A Lightweight Shuffle and Purification Framework for Monocular Depth Estimation PuriLight:一种轻量级的单目深度估计洗牌与净化框架 depth estimation monocular depth
24 Interpretable Vision Transformers in Monocular Depth Estimation via SVDA 提出SVDA的单目深度估计Transformer,实现可解释的自注意力机制 depth estimation monocular depth

🔬 支柱七:动作重定向 (Motion Retargeting) (2 篇)

#题目一句话要点标签🔗
25 MapVerse: A Benchmark for Geospatial Question Answering on Diverse Real-World Maps MapVerse:一个用于评估真实世界地图上地理空间问答能力的大规模基准数据集。 spatial relationship large language model multimodal
26 SurfPhase: 3D Interfacial Dynamics in Two-Phase Flows from Sparse Videos SurfPhase:提出一种从稀疏视频重建两相流三维界面动态的方法 geometric consistency

🔬 支柱四:生成式动作 (Generative Motion) (1 篇)

#题目一句话要点标签🔗
27 Multimodal Priors-Augmented Text-Driven 3D Human-Object Interaction Generation 提出MP-HOI框架,利用多模态先验知识生成高质量的文本驱动3D人-物交互动作 motion generation human-object interaction HOI

🔬 支柱八:物理动画 (Physics-based Animation) (1 篇)

#题目一句话要点标签🔗
28 DeepImageSearch: Benchmarking Multimodal Agents for Context-Aware Image Retrieval in Visual Histories 提出DeepImageSearch,通过Agent范式解决视觉历史中上下文感知图像检索问题 spatiotemporal multimodal

⬅️ 返回 cs.CV 首页 · 🏠 返回主页