cs.CV（2025-07-13）

📊 共 13 篇论文 | 🔗 3 篇有代码

🎯 兴趣领域导航

支柱九：具身大模型 (Embodied Foundation Models) (5 🔗1) 支柱二：RL算法与架构 (RL & Architecture) (4 🔗1) 支柱一：机器人控制 (Robot Control) (2) 支柱三：空间感知与语义 (Perception & Semantics) (1) 支柱八：物理动画 (Physics-based Animation) (1 🔗1)

🔬 支柱九：具身大模型 (Embodied Foundation Models) (5 篇)

#	题目	一句话要点	标签	🔗	⭐
1	MENTOR: Efficient Multimodal-Conditioned Tuning for Autoregressive Vision Generation Models	提出MENTOR，一种高效的多模态条件自回归视觉生成模型微调框架	multimodal	✅
2	Prompt Engineering in Segment Anything Model: Methodologies, Applications, and Emerging Challenges	针对SAM的提示工程综述：方法、应用与挑战	foundation model multimodal
3	VDInstruct: Zero-Shot Key Information Extraction via Content-Aware Vision Tokenization	VDInstruct：通过内容感知视觉Token化实现零样本关键信息抽取	large language model multimodal
4	ExpStar: Towards Automatic Commentary Generation for Multi-discipline Scientific Experiments	提出ExpStar模型，用于多学科科学实验的自动解说生成。	multimodal
5	WordCraft: Interactive Artistic Typography with Attention Awareness and Noise Blending	WordCraft：提出一种交互式艺术字体生成系统，支持局部编辑和风格迭代。	large language model

🔬 支柱二：RL算法与架构 (RL & Architecture) (4 篇)

#	题目	一句话要点	标签	🔗	⭐
6	Prompt2DEM: High-Resolution DEMs for Urban and Open Environments from Global Prompts Using a Monocular Foundation Model	Prompt2DEM：利用单目基础模型和全局提示，生成城市和开放环境的高分辨率DEM	MAE depth estimation monocular depth	✅
7	QuarterMap: Efficient Post-Training Token Pruning for Visual State Space Models	QuarterMap：为视觉状态空间模型设计的高效后训练Token剪枝方法	Mamba SSM state space model
8	HMID-Net: An Exploration of Masked Image Modeling and Knowledge Distillation in Hyperbolic Space	提出HMID-Net，探索双曲空间中的掩码图像建模与知识蒸馏，提升视觉语义层级结构学习效率。	distillation multimodal
9	Advancing Text-to-3D Generation with Linearized Lookahead Variational Score Distillation	提出线性化前瞻变分分数蒸馏(L²-VSD)，提升文本到3D生成质量。	distillation

🔬 支柱一：机器人控制 (Robot Control) (2 篇)

#	题目	一句话要点	标签	🔗	⭐
10	SegVec3D: A Method for Vector Embedding of 3D Objects Oriented Towards Robot manipulation	SegVec3D：面向机器人操作的3D物体向量嵌入实例分割方法	manipulation multimodal
11	Visuo-Acoustic Hand Pose and Contact Estimation	提出VibeMesh以解决手势与接触事件估计问题	manipulation

🔬 支柱三：空间感知与语义 (Perception & Semantics) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
12	VRU-Accident: A Vision-Language Benchmark for Video Question Answering and Dense Captioning for Accident Scene Understanding	提出VRU-Accident基准，用于评估MLLM在VRU事故场景下的视频问答和密集描述能力	scene understanding large language model multimodal

🔬 支柱八：物理动画 (Physics-based Animation) (1 篇)

#	题目	一句话要点	标签	🔗	⭐
13	VST-Pose: A Velocity-Integrated Spatiotem-poral Attention Network for Human WiFi Pose Estimation	VST-Pose：基于WiFi和时空注意力网络的人体姿态估计，应用于智能家居	spatiotemporal	✅

⬅️ 返回 cs.CV 首页 · 🏠 返回主页