cs.CV(2025-09-20)
📊 共 15 篇论文 | 🔗 2 篇有代码
🎯 兴趣领域导航
支柱九:具身大模型 (Embodied Foundation Models) (6 🔗1)
支柱三:空间感知与语义 (Perception & Semantics) (4)
支柱二:RL算法与架构 (RL & Architecture) (3 🔗1)
支柱一:机器人控制 (Robot Control) (1)
支柱四:生成式动作 (Generative Motion) (1)
🔬 支柱九:具身大模型 (Embodied Foundation Models) (6 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 1 | KV-Efficient VLA: A Method to Speed up Vision Language Models with RNN-Gated Chunked KV Cache | KV-Efficient VLA:利用RNN门控分块KV缓存加速视觉语言模型 | vision-language-action VLA | ||
| 2 | MMPart: Harnessing Multi-Modal Large Language Models for Part-Aware 3D Generation | MMPart:利用多模态大语言模型实现部件感知的3D生成 | large language model | ||
| 3 | Animalbooth: multimodal feature enhancement for animal subject personalization | AnimalBooth:多模态特征增强的动物主体个性化图像生成框架 | multimodal | ||
| 4 | Detection and Simulation of Urban Heat Islands Using a Fine-Tuned Geospatial Foundation Model | 利用微调的地理空间基础模型进行城市热岛检测与模拟 | foundation model | ||
| 5 | Segment-to-Act: Label-Noise-Robust Action-Prompted Video Segmentation Towards Embodied Intelligence | 提出ActiSeg-NL基准,研究标签噪声下动作提示视频分割问题,并提出PMHM模块。 | multimodal | ✅ | |
| 6 | VC-Inspector: Advancing Reference-free Evaluation of Video Captions with Factual Analy | 提出VC-Inspector,用于视频字幕的事实性无参考评价,提升准确性和可解释性。 | multimodal |
🔬 支柱三:空间感知与语义 (Perception & Semantics) (4 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 7 | Text-Scene: A Scene-to-Language Parsing Framework for 3D Scene Understanding | 提出Text-Scene框架,实现3D场景到自然语言的自动解析,促进3D场景理解。 | scene understanding affordance spatial relationship | ||
| 8 | ST-GS: Vision-Based 3D Semantic Occupancy Prediction with Spatial-Temporal Gaussian Splatting | 提出ST-GS框架,利用时空高斯溅射提升视觉中心自动驾驶中的3D语义占据预测 | gaussian splatting splatting scene understanding | ||
| 9 | MedGS: Gaussian Splatting for Multi-Modal 3D Medical Imaging | MedGS:基于高斯溅射的多模态3D医学影像重建与插值 | gaussian splatting splatting | ||
| 10 | SQS: Enhancing Sparse Perception Models via Query-based Splatting in Autonomous Driving | SQS:通过查询式Splatting增强自动驾驶中的稀疏感知模型 | splatting |
🔬 支柱二:RL算法与架构 (RL & Architecture) (3 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 11 | Surgical-MambaLLM: Mamba2-enhanced Multimodal Large Language Model for VQLA in Robotic Surgery | Surgical-MambaLLM:基于Mamba2增强的多模态大语言模型,用于机器人手术中的视觉问题定位回答 | Mamba large language model multimodal | ||
| 12 | Learning Hyperspectral Images with Curated Text Prompts for Efficient Multimodal Alignment | 利用文本提示学习高光谱图像,实现高效多模态对齐 | distillation scene understanding HSI | ||
| 13 | Captioning for Text-Video Retrieval via Dual-Group Direct Preference Optimization | 提出CaRe-DPO框架,通过双组直接偏好优化提升文本-视频检索中字幕生成质量。 | DPO direct preference optimization large language model | ✅ |
🔬 支柱一:机器人控制 (Robot Control) (1 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 14 | Person Identification from Egocentric Human-Object Interactions using 3D Hand Pose | I2S框架:利用3D手部姿态进行人-物交互,实现增强现实中无感用户身份识别。 | manipulation bi-manual human-object interaction |
🔬 支柱四:生成式动作 (Generative Motion) (1 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 15 | HyPlaneHead: Rethinking Tri-plane-like Representations in Full-Head Image Synthesis | 提出HyPlaneHead,通过混合平面表示实现高质量全头部图像合成 | penetration |