cs.CV(2025-07-13)
📊 共 13 篇论文 | 🔗 3 篇有代码
🎯 兴趣领域导航
支柱九:具身大模型 (Embodied Foundation Models) (5 🔗1)
支柱二:RL算法与架构 (RL & Architecture) (4 🔗1)
支柱一:机器人控制 (Robot Control) (2)
支柱三:空间感知与语义 (Perception & Semantics) (1)
支柱八:物理动画 (Physics-based Animation) (1 🔗1)
🔬 支柱九:具身大模型 (Embodied Foundation Models) (5 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 1 | MENTOR: Efficient Multimodal-Conditioned Tuning for Autoregressive Vision Generation Models | 提出MENTOR,一种高效的多模态条件自回归视觉生成模型微调框架 | multimodal | ✅ | |
| 2 | Prompt Engineering in Segment Anything Model: Methodologies, Applications, and Emerging Challenges | 针对SAM的提示工程综述:方法、应用与挑战 | foundation model multimodal | ||
| 3 | VDInstruct: Zero-Shot Key Information Extraction via Content-Aware Vision Tokenization | VDInstruct:通过内容感知视觉Token化实现零样本关键信息抽取 | large language model multimodal | ||
| 4 | ExpStar: Towards Automatic Commentary Generation for Multi-discipline Scientific Experiments | 提出ExpStar模型,用于多学科科学实验的自动解说生成。 | multimodal | ||
| 5 | WordCraft: Interactive Artistic Typography with Attention Awareness and Noise Blending | WordCraft:提出一种交互式艺术字体生成系统,支持局部编辑和风格迭代。 | large language model |
🔬 支柱二:RL算法与架构 (RL & Architecture) (4 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 6 | Prompt2DEM: High-Resolution DEMs for Urban and Open Environments from Global Prompts Using a Monocular Foundation Model | Prompt2DEM:利用单目基础模型和全局提示,生成城市和开放环境的高分辨率DEM | MAE depth estimation monocular depth | ✅ | |
| 7 | QuarterMap: Efficient Post-Training Token Pruning for Visual State Space Models | QuarterMap:为视觉状态空间模型设计的高效后训练Token剪枝方法 | Mamba SSM state space model | ||
| 8 | HMID-Net: An Exploration of Masked Image Modeling and Knowledge Distillation in Hyperbolic Space | 提出HMID-Net,探索双曲空间中的掩码图像建模与知识蒸馏,提升视觉语义层级结构学习效率。 | distillation multimodal | ||
| 9 | Advancing Text-to-3D Generation with Linearized Lookahead Variational Score Distillation | 提出线性化前瞻变分分数蒸馏(L²-VSD),提升文本到3D生成质量。 | distillation |
🔬 支柱一:机器人控制 (Robot Control) (2 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 10 | SegVec3D: A Method for Vector Embedding of 3D Objects Oriented Towards Robot manipulation | SegVec3D:面向机器人操作的3D物体向量嵌入实例分割方法 | manipulation multimodal | ||
| 11 | Visuo-Acoustic Hand Pose and Contact Estimation | 提出VibeMesh以解决手势与接触事件估计问题 | manipulation |
🔬 支柱三:空间感知与语义 (Perception & Semantics) (1 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 12 | VRU-Accident: A Vision-Language Benchmark for Video Question Answering and Dense Captioning for Accident Scene Understanding | 提出VRU-Accident基准,用于评估MLLM在VRU事故场景下的视频问答和密集描述能力 | scene understanding large language model multimodal |
🔬 支柱八:物理动画 (Physics-based Animation) (1 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 13 | VST-Pose: A Velocity-Integrated Spatiotem-poral Attention Network for Human WiFi Pose Estimation | VST-Pose:基于WiFi和时空注意力网络的人体姿态估计,应用于智能家居 | spatiotemporal | ✅ |