cs.CV(2025-09-20)

📊 共 15 篇论文 | 🔗 2 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (6 🔗1) 支柱三:空间感知与语义 (Perception & Semantics) (4) 支柱二:RL算法与架构 (RL & Architecture) (3 🔗1) 支柱一:机器人控制 (Robot Control) (1) 支柱四:生成式动作 (Generative Motion) (1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (6 篇)

#题目一句话要点标签🔗
1 KV-Efficient VLA: A Method to Speed up Vision Language Models with RNN-Gated Chunked KV Cache KV-Efficient VLA:利用RNN门控分块KV缓存加速视觉语言模型 vision-language-action VLA
2 MMPart: Harnessing Multi-Modal Large Language Models for Part-Aware 3D Generation MMPart:利用多模态大语言模型进行部件感知的3D生成 large language model
3 Animalbooth: multimodal feature enhancement for animal subject personalization AnimalBooth:通过多模态特征增强实现动物主题个性化图像生成 multimodal
4 Detection and Simulation of Urban Heat Islands Using a Fine-Tuned Geospatial Foundation Model 利用微调的地理空间基础模型进行城市热岛检测与模拟 foundation model
5 Advancing Reference-free Evaluation of Video Captions with Factual Analysis 提出VC-Inspector,一种基于事实分析的视频字幕无参考评价框架 large language model multimodal
6 Segment-to-Act: Label-Noise-Robust Action-Prompted Video Segmentation Towards Embodied Intelligence 提出ActiSeg-NL基准,研究标签噪声下动作引导的视频分割,并提出PMHM提升鲁棒性。 multimodal

🔬 支柱三:空间感知与语义 (Perception & Semantics) (4 篇)

#题目一句话要点标签🔗
7 Text-Scene: A Scene-to-Language Parsing Framework for 3D Scene Understanding Text-Scene:提出一种场景到语言的解析框架,用于3D场景理解。 scene understanding affordance spatial relationship
8 ST-GS: Vision-Based 3D Semantic Occupancy Prediction with Spatial-Temporal Gaussian Splatting 提出ST-GS框架,通过时空高斯溅射提升视觉中心自动驾驶中的3D语义占据预测 gaussian splatting splatting scene understanding
9 MedGS: Gaussian Splatting for Multi-Modal 3D Medical Imaging MedGS:基于高斯溅射的多模态3D医学影像重建与插值 gaussian splatting splatting
10 SQS: Enhancing Sparse Perception Models via Query-based Splatting in Autonomous Driving SQS:基于查询Splatting增强自动驾驶稀疏感知模型 splatting

🔬 支柱二:RL算法与架构 (RL & Architecture) (3 篇)

#题目一句话要点标签🔗
11 Surgical-MambaLLM: Mamba2-enhanced Multimodal Large Language Model for VQLA in Robotic Surgery Surgical-MambaLLM:基于Mamba2增强的多模态大语言模型,用于机器人手术中的视觉问题定位回答 Mamba large language model multimodal
12 Learning Hyperspectral Images with Curated Text Prompts for Efficient Multimodal Alignment 利用文本提示学习高光谱图像,实现高效多模态对齐 distillation scene understanding HSI
13 Captioning for Text-Video Retrieval via Dual-Group Direct Preference Optimization 提出CaRe-DPO框架,通过双组直接偏好优化提升文本-视频检索中字幕生成质量。 DPO direct preference optimization large language model

🔬 支柱一:机器人控制 (Robot Control) (1 篇)

#题目一句话要点标签🔗
14 Person Identification from Egocentric Human-Object Interactions using 3D Hand Pose I2S框架:利用3D手部姿态进行人-物交互的用户身份识别 manipulation bi-manual human-object interaction

🔬 支柱四:生成式动作 (Generative Motion) (1 篇)

#题目一句话要点标签🔗
15 HyPlaneHead: Rethinking Tri-plane-like Representations in Full-Head Image Synthesis 提出HyPlaneHead,通过混合平面表示实现高质量全头部图像合成 penetration

⬅️ 返回 cs.CV 首页 · 🏠 返回主页