cs.CV(2025-03-25)
📊 共 23 篇论文 | 🔗 10 篇有代码
🎯 兴趣领域导航
支柱九:具身大模型 (Embodied Foundation Models) (7 🔗3)
支柱二:RL算法与架构 (RL & Architecture) (6 🔗3)
支柱三:空间感知与语义 (Perception & Semantics) (5 🔗2)
支柱一:机器人控制 (Robot Control) (4 🔗2)
支柱四:生成式动作 (Generative Motion) (1)
🔬 支柱九:具身大模型 (Embodied Foundation Models) (7 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 1 | CoLLM: A Large Language Model for Composed Image Retrieval | 提出CoLLM,利用大语言模型解决组合图像检索中的数据稀缺和多模态融合难题。 | large language model multimodal | ||
| 2 | Hyperdimensional Uncertainty Quantification for Multimodal Uncertainty Fusion in Autonomous Vehicles Perception | 提出HyperDUM,利用超维计算高效量化自动驾驶感知中的多模态不确定性。 | multimodal | ||
| 3 | FullDiT: Multi-Task Video Generative Foundation Model with Full Attention | FullDiT:基于全注意力机制的多任务视频生成基础模型 | foundation model | ||
| 4 | PAVE: Patching and Adapting Video Large Language Models | PAVE:通过轻量级适配器增强视频大语言模型的多模态理解能力 | large language model | ✅ | |
| 5 | Towards Online Multi-Modal Social Interaction Understanding | 提出Online-MMSI-VLM框架,用于在线多模态社交互动理解,解决实时人机交互问题。 | large language model multimodal | ✅ | |
| 6 | Audio-centric Video Understanding Benchmark without Text Shortcut | 提出AVUT:一个以音频为中心的视频理解基准,解决文本捷径问题。 | large language model multimodal | ✅ | |
| 7 | RGB-Th-Bench: A Dense benchmark for Visual-Thermal Understanding of Vision Language Models | 提出RGB-Th-Bench,用于评估视觉语言模型对RGB-Thermal图像对的理解能力。 | multimodal |
🔬 支柱二:RL算法与架构 (RL & Architecture) (6 篇)
🔬 支柱三:空间感知与语义 (Perception & Semantics) (5 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 14 | OpenLex3D: A Tiered Evaluation Benchmark for Open-Vocabulary 3D Scene Representations | OpenLex3D:用于开放词汇3D场景表示的分层评估基准 | scene understanding open-vocabulary open vocabulary | ✅ | |
| 15 | LPOSS: Label Propagation Over Patches and Pixels for Open-vocabulary Semantic Segmentation | 提出LPOSS+,通过标签传播优化视觉语言模型,实现开放词汇语义分割。 | open-vocabulary open vocabulary | ✅ | |
| 16 | Show or Tell? Effectively prompting Vision-Language Models for semantic segmentation | 针对语义分割,研究如何有效提示视觉-语言模型,并提出PromptMatcher。 | open-vocabulary open vocabulary foundation model | ||
| 17 | The Coralscapes Dataset: Semantic Scene Understanding in Coral Reefs | 发布Coralscapes珊瑚礁图像数据集,用于珊瑚礁场景的语义理解 | scene understanding | ||
| 18 | Vanishing Depth: A Depth Adapter with Positional Depth Encoding for Generalized Image Encoders | 提出Vanishing Depth,通过位置深度编码增强通用图像编码器,实现度量深度理解。 | metric depth |
🔬 支柱一:机器人控制 (Robot Control) (4 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 19 | TokenHSI: Unified Synthesis of Physical Human-Scene Interactions through Task Tokenization | TokenHSI:通过任务Token化统一合成物理人-场景交互 | humanoid physically plausible human-scene interaction | ✅ | |
| 20 | OpenSDI: Spotting Diffusion-Generated Images in the Open World | 提出OpenSDI数据集与SPM框架,用于开放世界中扩散模型生成图像的检测与定位。 | manipulation masked autoencoder MAE | ✅ | |
| 21 | G-DexGrasp: Generalizable Dexterous Grasping Synthesis Via Part-Aware Prior Retrieval and Prior-Assisted Generation | G-DexGrasp:通过部件感知先验检索与辅助生成实现通用灵巧抓取合成 | dexterous hand affordance | ||
| 22 | PartRM: Modeling Part-Level Dynamics with Large Cross-State Reconstruction Model | PartRM:利用大规模跨状态重建模型建模部件级动态 | manipulation world model |
🔬 支柱四:生成式动作 (Generative Motion) (1 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 23 | Learning 3D Object Spatial Relationships from Pre-trained 2D Diffusion Models | 利用预训练2D扩散模型学习3D物体空间关系 | motion synthesis spatial relationship |