cs.CV(2025-12-23)
📊 共 17 篇论文 | 🔗 4 篇有代码
🎯 兴趣领域导航
支柱九:具身大模型 (Embodied Foundation Models) (7 🔗2)
支柱三:空间感知与语义 (Perception & Semantics) (4 🔗1)
支柱二:RL算法与架构 (RL & Architecture) (3)
支柱七:动作重定向 (Motion Retargeting) (1)
支柱一:机器人控制 (Robot Control) (1 🔗1)
支柱六:视频提取与匹配 (Video Extraction) (1)
🔬 支柱九:具身大模型 (Embodied Foundation Models) (7 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 1 | NULLBUS: Multimodal Mixed-Supervision for Breast Ultrasound Segmentation via Nullable Global-Local Prompts | NullBUS:通过可空全局-局部提示的多模态混合监督乳腺超声分割 | multimodal | ||
| 2 | FlashVLM: Text-Guided Visual Token Selection for Large Multimodal Models | FlashVLM:文本引导的视觉Token选择,提升大模型多模态效率 | multimodal | ||
| 3 | VideoScaffold: Elastic-Scale Visual Hierarchies for Streaming Video Understanding in MLLMs | VideoScaffold:面向MLLM的弹性尺度视觉层级,用于流式视频理解 | large language model multimodal | ✅ | |
| 4 | Input-Adaptive Visual Preprocessing for Efficient Fast Vision-Language Model Inference | 提出输入自适应视觉预处理方法,提升FastVLM在视觉问答任务中的推理效率。 | multimodal | ✅ | |
| 5 | SpatialTree: How Spatial Abilities Branch Out in MLLMs | 构建SpatialTree,系统评估并提升MLLM的空间认知能力 | multimodal | ||
| 6 | Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models | 提出DSR Suite和几何选择模块GSM,提升VLM在动态空间推理能力 | foundation model | ||
| 7 | Towards Natural Language-Based Document Image Retrieval: New Dataset and Benchmark | 提出NL-DIR基准数据集,用于解决自然语言描述的文档图像检索问题 | large language model |
🔬 支柱三:空间感知与语义 (Perception & Semantics) (4 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 8 | Enhancing annotations for 5D apple pose estimation through 3D Gaussian Splatting (3DGS) | 利用3D高斯溅射增强5D苹果姿态估计的标注效率 | 3D gaussian splatting 3DGS gaussian splatting | ||
| 9 | SirenPose: Dynamic Scene Reconstruction via Geometric Supervision | SirenPose:通过几何监督实现动态场景的精确重建与时序一致性 | scene reconstruction physically plausible spatiotemporal | ||
| 10 | AlignPose: Generalizable 6D Pose Estimation via Multi-view Feature-metric Alignment | AlignPose:基于多视角特征度量对齐的通用6D位姿估计 | 6D pose estimation | ||
| 11 | SmartSplat: Feature-Smart Gaussians for Scalable Compression of Ultra-High-Resolution Images | SmartSplat:提出特征感知的GS图像压缩框架,解决超高分辨率图像的高效压缩与高质量重建问题。 | 3D gaussian splatting gaussian splatting splatting | ✅ |
🔬 支柱二:RL算法与架构 (RL & Architecture) (3 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 12 | Bridging Modalities and Transferring Knowledge: Enhanced Multimodal Understanding and Recognition | 提出多模态对齐、翻译、融合与迁移方法,提升复杂输入理解与识别能力 | distillation egocentric multimodal | ||
| 13 | Active Intelligence in Video Avatars via Closed-loop World Modeling | 提出ORCA框架,通过闭环世界建模实现视频化身的主动智能 | world model | ||
| 14 | Multi-objective hybrid knowledge distillation for efficient deep learning in smart agriculture | 提出面向智慧农业的高效深度学习多目标混合知识蒸馏框架 | distillation |
🔬 支柱七:动作重定向 (Motion Retargeting) (1 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 15 | Beyond Motion Pattern: An Empirical Study of Physical Forces for Human Motion Understanding | 融合物理力信息的运动理解:提升步态、动作识别与视频描述性能 | human motion |
🔬 支柱一:机器人控制 (Robot Control) (1 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 16 | LEAD: Minimizing Learner-Expert Asymmetry in End-to-End Driving | LEAD:最小化端到端驾驶中学习器-专家不对称性,提升CARLA模拟器驾驶性能 | sim-to-real imitation learning | ✅ |
🔬 支柱六:视频提取与匹配 (Video Extraction) (1 篇)
| # | 题目 | 一句话要点 | 标签 | 🔗 | ⭐ |
|---|---|---|---|---|---|
| 17 | DETACH : Decomposed Spatio-Temporal Alignment for Exocentric Video and Ambient Sensors with Staged Learning | 提出DETACH框架,通过解耦时空对齐解决外中心视频与环境传感器融合问题 | egocentric |