cs.CV(2025-07-13)

📊 共 13 篇论文 | 🔗 3 篇有代码

🎯 兴趣领域导航

支柱九:具身大模型 (Embodied Foundation Models) (5 🔗1) 支柱二:RL算法与架构 (RL & Architecture) (4 🔗1) 支柱一:机器人控制 (Robot Control) (2) 支柱三:空间感知与语义 (Perception & Semantics) (1) 支柱八:物理动画 (Physics-based Animation) (1 🔗1)

🔬 支柱九:具身大模型 (Embodied Foundation Models) (5 篇)

#题目一句话要点标签🔗
1 MENTOR: Efficient Multimodal-Conditioned Tuning for Autoregressive Vision Generation Models 提出MENTOR,一种高效的多模态条件自回归视觉生成模型微调框架 multimodal
2 Prompt Engineering in Segment Anything Model: Methodologies, Applications, and Emerging Challenges 针对SAM的提示工程综述:方法、应用与挑战 foundation model multimodal
3 VDInstruct: Zero-Shot Key Information Extraction via Content-Aware Vision Tokenization VDInstruct:通过内容感知视觉Token化实现零样本关键信息抽取 large language model multimodal
4 ExpStar: Towards Automatic Commentary Generation for Multi-discipline Scientific Experiments 提出ExpStar模型,用于多学科科学实验的自动解说生成。 multimodal
5 WordCraft: Interactive Artistic Typography with Attention Awareness and Noise Blending WordCraft:提出一种交互式艺术字体生成系统,支持局部编辑和风格迭代。 large language model

🔬 支柱二:RL算法与架构 (RL & Architecture) (4 篇)

#题目一句话要点标签🔗
6 Prompt2DEM: High-Resolution DEMs for Urban and Open Environments from Global Prompts Using a Monocular Foundation Model Prompt2DEM:利用单目基础模型和全局提示,生成城市和开放环境的高分辨率DEM MAE depth estimation monocular depth
7 QuarterMap: Efficient Post-Training Token Pruning for Visual State Space Models QuarterMap:为视觉状态空间模型设计的高效后训练Token剪枝方法 Mamba SSM state space model
8 HMID-Net: An Exploration of Masked Image Modeling and Knowledge Distillation in Hyperbolic Space 提出HMID-Net,探索双曲空间中的掩码图像建模与知识蒸馏,提升视觉语义层级结构学习效率。 distillation multimodal
9 Advancing Text-to-3D Generation with Linearized Lookahead Variational Score Distillation 提出线性化前瞻变分分数蒸馏(L²-VSD),提升文本到3D生成质量。 distillation

🔬 支柱一:机器人控制 (Robot Control) (2 篇)

#题目一句话要点标签🔗
10 SegVec3D: A Method for Vector Embedding of 3D Objects Oriented Towards Robot manipulation SegVec3D:面向机器人操作的3D物体向量嵌入实例分割方法 manipulation multimodal
11 Visuo-Acoustic Hand Pose and Contact Estimation 提出VibeMesh以解决手势与接触事件估计问题 manipulation

🔬 支柱三:空间感知与语义 (Perception & Semantics) (1 篇)

#题目一句话要点标签🔗
12 VRU-Accident: A Vision-Language Benchmark for Video Question Answering and Dense Captioning for Accident Scene Understanding 提出VRU-Accident基准,用于评估MLLM在VRU事故场景下的视频问答和密集描述能力 scene understanding large language model multimodal

🔬 支柱八:物理动画 (Physics-based Animation) (1 篇)

#题目一句话要点标签🔗
13 VST-Pose: A Velocity-Integrated Spatiotem-poral Attention Network for Human WiFi Pose Estimation VST-Pose:基于WiFi和时空注意力网络的人体姿态估计,应用于智能家居 spatiotemporal

⬅️ 返回 cs.CV 首页 · 🏠 返回主页