Lang2Lift: A Framework for Language-Guided Pallet Detection and Pose Estimation Integrated in Autonomous Outdoor Forklift Operation

作者: Huy Hoang Nguyen, Johannes Huemer, Markus Murschitz, Tobias Glueck, Minh Nhat Vu, Andreas Kugi

分类: cs.RO, cs.CV

发布日期: 2025-08-21

备注: 8 pages, 7 figures

🔗 代码/项目: PROJECT_PAGE

💡 一句话要点

提出Lang2Lift框架以解决户外叉车自动化搬运问题

🎯 匹配领域: 支柱一：机器人控制 (Robot Control) 支柱三：空间感知与语义 (Perception & Semantics) 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 托盘检测 姿态估计 自然语言处理 自动化叉车 物流自动化 多模态融合 计算机视觉

📋 核心要点

现有的托盘处理方法在户外环境中面临负载变化和环境杂乱等挑战，导致自动化效率低下。
Lang2Lift框架通过自然语言指令实现托盘检测和姿态估计，简化了操作流程，提高了自动化水平。
在ADAPT平台上，Lang2Lift实现了0.76 mIoU的托盘分割精度，展示了其在实际应用中的有效性和可靠性。

📝 摘要（中文）

物流和建筑行业在户外环境中自动化托盘处理面临持续挑战，尤其是在负载变化、托盘质量和尺寸不一致以及环境杂乱的情况下。本文提出Lang2Lift框架，利用基础模型进行自然语言引导的托盘检测和6D姿态估计，使操作员能够通过直观的命令指定目标，如“在起重机附近捡起钢梁托盘”。该感知管道集成了Florence-2和SAM-2进行基于语言的分割，并结合FoundationPose在复杂的多托盘户外场景中进行稳健的姿态估计。实验验证表明，Lang2Lift在ADAPT自动叉车平台上实现了0.76 mIoU的托盘分割精度，系统的时效性和错误分析证明了其在实际物流和建筑环境中的可行性。

🔬 方法详解

问题定义：本文旨在解决户外环境中托盘搬运的自动化问题，现有方法在处理负载变化和环境复杂性时表现不佳，导致效率低下和安全隐患。

核心思路：Lang2Lift框架通过自然语言指令引导托盘检测和姿态估计，允许操作员以直观的方式指定目标，从而提升自动化的灵活性和准确性。

技术框架：该框架包括感知管道和运动规划模块。感知管道结合Florence-2和SAM-2进行语言引导的分割，使用FoundationPose进行姿态估计，最终将姿态信息传递给运动规划模块，实现叉车的全自动操作。

关键创新：Lang2Lift的主要创新在于将自然语言处理与视觉感知相结合，显著提高了在复杂环境中进行托盘检测和姿态估计的能力，区别于传统方法的单一视觉输入。

关键设计：在设计中，采用了多模态融合技术，确保了在不同光照条件下的稳健性，损失函数和网络结构经过优化，以提高分割和姿态估计的精度。具体参数设置和网络架构细节在论文中详细描述。

🖼️ 关键图片

📊 实验亮点

Lang2Lift在ADAPT自动叉车平台上实现了0.76 mIoU的托盘分割精度，显示出其在复杂户外环境中的有效性。实验结果表明，该系统在时效性和准确性方面均表现出色，具备实际应用的可行性。

🎯 应用场景

Lang2Lift框架具有广泛的应用潜力，特别是在物流和建筑行业的自动化搬运任务中。通过提高托盘处理的自动化水平，该技术能够有效应对人力资源短缺和安全隐患，提升整体工作效率。未来，该框架还可扩展到其他自动化设备和场景中，推动智能物流的发展。

📄 摘要（原文）

The logistics and construction industries face persistent challenges in automating pallet handling, especially in outdoor environments with variable payloads, inconsistencies in pallet quality and dimensions, and unstructured surroundings. In this paper, we tackle automation of a critical step in pallet transport: the pallet pick-up operation. Our work is motivated by labor shortages, safety concerns, and inefficiencies in manually locating and retrieving pallets under such conditions. We present Lang2Lift, a framework that leverages foundation models for natural language-guided pallet detection and 6D pose estimation, enabling operators to specify targets through intuitive commands such as "pick up the steel beam pallet near the crane." The perception pipeline integrates Florence-2 and SAM-2 for language-grounded segmentation with FoundationPose for robust pose estimation in cluttered, multi-pallet outdoor scenes under variable lighting. The resulting poses feed into a motion planning module for fully autonomous forklift operation. We validate Lang2Lift on the ADAPT autonomous forklift platform, achieving 0.76 mIoU pallet segmentation accuracy on a real-world test dataset. Timing and error analysis demonstrate the system's robustness and confirm its feasibility for deployment in operational logistics and construction environments. Video demonstrations are available at https://eric-nguyen1402.github.io/lang2lift.github.io/

Lang2Lift: A Framework for Language-Guided Pallet Detection and Pose Estimation Integrated in Autonomous Outdoor Forklift Operation

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理