cs.CV(2025-12-15)
📊 共 11 篇论文
🎯 兴趣领域导航
支柱九:具身大模型 (Embodied Foundation Models) (5)
支柱二:RL算法与架构 (RL & Architecture) (2)
支柱三:空间感知与语义 (Perception & Semantics) (2)
支柱四:生成式动作 (Generative Motion) (1)
支柱七:动作重定向 (Motion Retargeting) (1)
🔬 支柱九:具身大模型 (Embodied Foundation Models) (5 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 1 | Unified Interactive Multimodal Moment Retrieval via Cascaded Embedding-Reranking and Temporal-Aware Score Fusion | 提出级联嵌入重排序和时序感知融合的统一交互式多模态片段检索系统 | multimodal | ||
| 2 | Revisiting 2D Foundation Models for Scalable 3D Medical Image Classification | AnyMC3D:利用2D预训练模型进行可扩展的3D医学图像分类。 | foundation model | ||
| 3 | VLCache: Computing 2% Vision Tokens and Reusing 98% for Vision-Language Inference | VLCache:视觉语言推理中计算2% tokens,重用98% tokens,加速推理。 | multimodal | ||
| 4 | MADTempo: An Interactive System for Multi-Event Temporal Video Retrieval with Query Augmentation | MADTempo:一种交互式多事件时序视频检索系统,支持查询增强 | visual grounding | ||
| 5 | Towards Interactive Intelligence for Digital Humans | 提出Mio框架,实现具备个性化表达、自适应交互和自我进化能力的交互式数字人。 | multimodal |
🔬 支柱二:RL算法与架构 (RL & Architecture) (2 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 6 | MindDrive: A Vision-Language-Action Model for Autonomous Driving via Online Reinforcement Learning | MindDrive:基于在线强化学习的视觉-语言-动作自动驾驶模型 | reinforcement learning imitation learning vision-language-action | ||
| 7 | Motus: A Unified Latent Action World Model | Motus:统一的潜在动作世界模型,提升具身智能体在仿真和真实世界的性能 | world model optical flow vision-language-action |
🔬 支柱三:空间感知与语义 (Perception & Semantics) (2 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 8 | Computer vision training dataset generation for robotic environments using Gaussian splatting | 提出基于高斯溅射的机器人环境计算机视觉训练数据集自动生成方法 | 3D gaussian splatting 3DGS gaussian splatting | ||
| 9 | MMDrive: Interactive Scene Understanding Beyond Vision with Multi-representational Fusion | MMDrive:提出多模态融合的交互式场景理解框架,超越视觉局限 | scene understanding multimodal |
🔬 支柱四:生成式动作 (Generative Motion) (1 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 10 | Why Text Prevails: Vision May Undermine Multimodal Medical Decision Making | 多模态医学决策中,文本信息优于视觉信息,揭示MLLM视觉理解不足 | MDM large language model multimodal |
🔬 支柱七:动作重定向 (Motion Retargeting) (1 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 11 | Content Adaptive based Motion Alignment Framework for Learned Video Compression | 提出基于内容自适应的运动对齐框架CAMA,提升学习型视频压缩性能 | motion estimation |