cs.CV(2025-12-26)
📊 共 15 篇论文 | 🔗 1 篇有代码
🎯 兴趣领域导航
支柱九:具身大模型 (Embodied Foundation Models) (6)
支柱二:RL算法与架构 (RL & Architecture) (3)
支柱八:物理动画 (Physics-based Animation) (2)
支柱一:机器人控制 (Robot Control) (2)
支柱四:生成式动作 (Generative Motion) (1)
支柱三:空间感知与语义 (Perception & Semantics) (1 🔗1)
🔬 支柱九:具身大模型 (Embodied Foundation Models) (6 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 1 | iSHIFT: Lightweight Slow-Fast GUI Agent with Adaptive Perception | iSHIFT:轻量级自适应感知的慢-快GUI代理,提升交互效率与精度 | large language model multimodal visual grounding | ||
| 2 | See Less, See Right: Bi-directional Perceptual Shaping For Multimodal Reasoning | 提出双向感知塑形方法以解决多模态推理中的视觉证据不足问题 | multimodal | ||
| 3 | Backdoor Attacks on Prompt-Driven Video Segmentation Foundation Models | 提出BadVSFM,针对Prompt驱动的视频分割基础模型的后门攻击框架。 | foundation model | ||
| 4 | Perceive and Calibrate: Analyzing and Enhancing Robustness of Medical Multi-Modal Large Language Models | 提出Inherent-enhanced Multi-modal Calibration框架,提升医学多模态大语言模型在噪声环境下的鲁棒性。 | large language model | ||
| 5 | SLIM-Brain: A Data- and Training-Efficient Foundation Model for fMRI Data Analysis | SLIM-Brain:一种数据与训练高效的fMRI分析基础模型 | foundation model | ||
| 6 | Training-free Conditional Image Embedding Framework Leveraging Large Vision Language Models | 提出DIOR:一种免训练的条件图像嵌入框架,利用大型视觉语言模型。 | foundation model |
🔬 支柱二:RL算法与架构 (RL & Architecture) (3 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 7 | Patch as Node: Human-Centric Graph Representation Learning for Multimodal Action Recognition | 提出PAN:一种以人为中心的图表示学习框架,用于多模态动作识别。 | representation learning spatiotemporal multimodal | ||
| 8 | VideoZoomer: Reinforcement-Learned Temporal Focusing for Long Video Reasoning | 提出VideoZoomer,通过强化学习动态聚焦长视频推理的关键帧。 | reinforcement learning large language model multimodal | ||
| 9 | Yume-1.5: A Text-Controlled Interactive World Generation Model | Yume-1.5:一种文本控制的交互式世界生成模型,提升实时性和可控性。 | linear attention distillation |
🔬 支柱八:物理动画 (Physics-based Animation) (2 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 10 | End-to-End 3D Spatiotemporal Perception with Multimodal Fusion and V2X Collaboration | 提出XET-V2X,用于V2X场景下多模态融合的端到端3D时空感知。 | spatiotemporal multimodal | ||
| 11 | LongFly: Long-Horizon UAV Vision-and-Language Navigation with Spatiotemporal Context Integration | LongFly:提出时空上下文整合框架,解决无人机长程视觉-语言导航问题 | spatiotemporal VLN multimodal |
🔬 支柱一:机器人控制 (Robot Control) (2 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 12 | VULCAN: Tool-Augmented Multi Agents for Iterative 3D Object Arrangement | VULCAN:工具增强的多智能体迭代式3D物体排列方法 | manipulation scene understanding large language model | ||
| 13 | Attack-Aware Deepfake Detection under Counter-Forensic Manipulations | 提出一种攻击感知的Deepfake检测器,增强在对抗取证下的鲁棒性与可信度。 | manipulation |
🔬 支柱四:生成式动作 (Generative Motion) (1 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 14 | DeMoGen: Towards Decompositional Human Motion Generation with Energy-Based Diffusion Models | DeMoGen:提出一种基于能量的扩散模型,用于分解式人体运动生成。 | text-to-motion motion generation human motion |
🔬 支柱三:空间感知与语义 (Perception & Semantics) (1 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 15 | Reloc-VGGT: Visual Re-localization with Geometry Grounded Transformer | 提出Reloc-VGGT,利用几何约束Transformer实现鲁棒高效的视觉重定位 | VGGT spatial relationship | ✅ |