KPGrasp: Scalable Keypoint Flow Matching for Dexterous Grasp Generation
作者: Yuansen Huang, Jiayi Chen, Haoran Liu, Yubin Ke, Bing Han, Jiangran Lyu, Mi Yan, Li Yi, He Wang
分类: cs.RO
发布日期: 2026-06-08
备注: 14 pages, 7 figures, 6 tables
💡 一句话要点
提出KPGrasp以解决高质量灵巧抓取生成问题
🎯 匹配领域: 支柱二:RL算法与架构 (RL & Architecture) 支柱四:生成式动作 (Generative Motion)
关键词: 灵巧抓取 流匹配 Transformer模型 机器人技术 空间推理
📋 核心要点
- 现有学习方法在生成高质量灵巧抓取时面临挑战,通常依赖于复杂的接触损失或昂贵的测试时优化。
- KPGrasp通过流匹配框架学习灵巧抓取先验,结合全欧几里得3D手关键点参数化和可扩展的Transformer流模型。
- 在Dexonomy基准上,KPGrasp实现76.3%的抓取成功率,相比最强基线提升47.4%,并在DexGrasp Anything基准上表现最佳。
📝 摘要(中文)
生成高质量的灵巧抓取仍然是学习方法面临的挑战,通常依赖于精心调整的接触损失或昂贵的基于接触的测试时优化。本文提出KPGrasp,一个流匹配框架,通过大规模数据学习灵巧抓取先验,而不依赖于接触损失或接触基础的测试时优化。KPGrasp将全欧几里得3D手关键点参数化与简单且可扩展的Transformer流模型相结合。该参数化避免了传统混合SE(3)姿态和关节角输出空间的缺点,以与物体点云在同一坐标系中表达抓取,从而实现原生空间推理;Transformer流模型仅使用标准流匹配损失进行训练,并能有效扩展数据、模型容量和批量大小。实验表明在两个仿真基准上达到了最先进的性能。
🔬 方法详解
问题定义:本文旨在解决高质量灵巧抓取生成中的挑战,现有方法往往依赖于复杂的接触损失或昂贵的测试时优化,导致效率低下和性能限制。
核心思路:KPGrasp通过流匹配框架学习灵巧抓取先验,避免了传统方法的缺陷,采用全欧几里得3D手关键点参数化,使得抓取与物体点云在同一坐标系中表达,从而实现更自然的空间推理。
技术框架:KPGrasp的整体架构包括两个主要模块:全欧几里得3D手关键点参数化和Transformer流模型。前者用于抓取表示,后者则通过标准流匹配损失进行训练,能够有效扩展至大规模数据和模型。
关键创新:KPGrasp的主要创新在于其流匹配框架和全欧几里得参数化设计,使得抓取生成过程更为高效和准确,显著提升了成功率。
关键设计:在设计中,KPGrasp使用标准流匹配损失进行训练,避免了复杂的接触损失设置,同时在批量推理时每个抓取仅需0.032秒,展现出良好的实时性能。
🖼️ 关键图片
📊 实验亮点
KPGrasp在Dexonomy基准上实现了76.3%的抓取成功率,相比最强基线提升47.4%,并将穿透深度降低至2.4毫米。同时,在DexGrasp Anything基准上,该模型在无微调的情况下也达到了最佳平均性能,展示了其优越的通用性和效率。
🎯 应用场景
KPGrasp的研究成果在机器人抓取、自动化装配和人机交互等领域具有广泛的应用潜力。其高效的抓取生成能力可以提升机器人在复杂环境中的操作能力,推动智能制造和服务机器人技术的发展。
📄 摘要(原文)
Generating high-quality dexterous grasps remains challenging for learning-based methods, which often depend on carefully tuned contact losses or costly contact-based test-time refinement. We present KPGrasp, a flow-matching framework that learns dexterous grasp priors from large-scale data rather than relying on contact losses or contact-based test-time refinement. KPGrasp couples an all-Euclidean 3D hand-keypoint parameterization with a simple yet scalable Transformer flow model. The parameterization avoids the drawbacks of the conventional mixed SE(3) pose and joint-angle output space, expresses grasps in the same frame as the object point cloud, and thus enables native spatial reasoning; the Transformer flow model is trained with only the standard flow-matching loss and scales effectively with data, model capacity, and batch size. Experiments demonstrate state-of-the-art performance on two simulation benchmarks. On the Dexonomy benchmark, it reaches a 76.3% grasp success rate, improving over the strongest directly comparable baseline by 47.4% while reducing penetration depth to 2.4 mm. The same model also achieves the best average performance on the DexGrasp Anything benchmark without fine-tuning. For batched inference, KPGrasp requires only 0.032 s per grasp. Finally, real-world experiments on 20 diverse objects demonstrate that the pipeline can be deployed in a real-world setup.