PlanAgent: A Multi-modal Large Language Agent for Closed-loop Vehicle Motion Planning

作者: Yupeng Zheng, Zebin Xing, Qichao Zhang, Bu Jin, Pengfei Li, Yuhang Zheng, Zhongpu Xia, Kun Zhan, Xianpeng Lang, Yaran Chen, Dongbin Zhao

分类: cs.RO

发布日期: 2024-06-03 (更新: 2024-06-04)

备注: This work has been submitted to the IEEE for possible publication

💡 一句话要点

提出PlanAgent，基于多模态大语言模型解决自动驾驶闭环运动规划问题

🎯 匹配领域: 支柱一：机器人控制 (Robot Control) 支柱三：空间感知与语义 (Perception & Semantics) 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 自动驾驶 运动规划 多模态大语言模型 闭环控制 常识推理

📋 核心要点

现有基于规则的车辆运动规划方法在常见场景中表现良好，但在长尾场景中难以泛化。
PlanAgent利用多模态大语言模型作为认知代理，引入人类知识和常识推理进行闭环规划。
在nuPlan基准测试中，PlanAgent的性能优于现有最先进的闭环运动规划方法。

📝 摘要（中文）

本文提出PlanAgent，一种基于多模态大语言模型（MLLM）的端到端车辆运动规划系统。PlanAgent利用MLLM作为认知代理，将类人知识、可解释性和常识推理引入闭环规划中。具体而言，PlanAgent通过三个核心模块发挥MLLM的优势：环境转换模块构建鸟瞰图（BEV）地图和基于车道图的文本描述作为输入；推理引擎模块引入分层思维链，从场景理解到横向和纵向运动指令，最终生成规划器代码；反射模块用于模拟和评估生成的规划器，以减少MLLM的不确定性。PlanAgent具备MLLM的常识推理和泛化能力，能够有效应对常见和复杂的长尾场景。在大型且具有挑战性的nuPlan基准测试中，PlanAgent优于现有的最先进的闭环运动规划方法。

🔬 方法详解

问题定义：车辆运动规划是自动驾驶的关键组成部分。现有基于规则的方法难以泛化到复杂的长尾场景，而基于学习的方法在大型闭环场景中的性能尚未超越规则方法。因此，需要一种能够利用常识推理和泛化能力，有效处理各种复杂场景的运动规划方法。

核心思路：PlanAgent的核心思路是利用多模态大语言模型（MLLM）的强大能力，将其作为认知代理，模拟人类驾驶员的思考过程。通过将环境信息转换为MLLM可以理解的多模态输入，并利用思维链推理生成规划器代码，从而实现更智能、更具泛化能力的运动规划。

技术框架：PlanAgent包含三个主要模块：1) 环境转换模块：将环境信息（如传感器数据）转换为鸟瞰图（BEV）地图和基于车道图的文本描述，作为MLLM的输入。2) 推理引擎模块：利用分层思维链，首先进行场景理解，然后生成横向和纵向运动指令，最后生成可执行的规划器代码。3) 反射模块：对生成的规划器进行模拟和评估，根据评估结果调整规划器，减少MLLM的不确定性。

关键创新：PlanAgent的关键创新在于将MLLM引入到车辆运动规划中，并将其作为一个认知代理来使用。通过环境转换、推理引擎和反射模块的协同工作，PlanAgent能够利用MLLM的常识推理和泛化能力，从而更好地处理复杂场景。这是第一个基于MLLM的端到端运动规划系统。

关键设计：环境转换模块的设计需要考虑如何有效地将环境信息编码为MLLM可以理解的多模态输入。推理引擎模块的关键在于如何设计思维链，使其能够模拟人类驾驶员的思考过程，并生成高质量的规划器代码。反射模块的设计需要考虑如何有效地评估规划器的性能，并根据评估结果进行调整。具体的参数设置、损失函数和网络结构等技术细节将在代码发布后公开。

🖼️ 关键图片

📊 实验亮点

PlanAgent在nuPlan基准测试中取得了显著的性能提升，超越了现有的最先进方法。具体的数据和提升幅度将在论文的后续版本和代码发布后公开。实验结果表明，PlanAgent能够有效地利用MLLM的常识推理和泛化能力，从而更好地处理复杂场景，并提高运动规划的性能。

🎯 应用场景

PlanAgent具有广泛的应用前景，可应用于各种自动驾驶场景，包括城市道路、高速公路和越野环境。该研究的实际价值在于提高了自动驾驶系统的安全性和可靠性，使其能够更好地应对复杂和未知的交通状况。未来，PlanAgent有望成为自动驾驶技术的核心组成部分，推动自动驾驶技术的商业化应用。

📄 摘要（原文）

Vehicle motion planning is an essential component of autonomous driving technology. Current rule-based vehicle motion planning methods perform satisfactorily in common scenarios but struggle to generalize to long-tailed situations. Meanwhile, learning-based methods have yet to achieve superior performance over rule-based approaches in large-scale closed-loop scenarios. To address these issues, we propose PlanAgent, the first mid-to-mid planning system based on a Multi-modal Large Language Model (MLLM). MLLM is used as a cognitive agent to introduce human-like knowledge, interpretability, and common-sense reasoning into the closed-loop planning. Specifically, PlanAgent leverages the power of MLLM through three core modules. First, an Environment Transformation module constructs a Bird's Eye View (BEV) map and a lane-graph-based textual description from the environment as inputs. Second, a Reasoning Engine module introduces a hierarchical chain-of-thought from scene understanding to lateral and longitudinal motion instructions, culminating in planner code generation. Last, a Reflection module is integrated to simulate and evaluate the generated planner for reducing MLLM's uncertainty. PlanAgent is endowed with the common-sense reasoning and generalization capability of MLLM, which empowers it to effectively tackle both common and complex long-tailed scenarios. Our proposed PlanAgent is evaluated on the large-scale and challenging nuPlan benchmarks. A comprehensive set of experiments convincingly demonstrates that PlanAgent outperforms the existing state-of-the-art in the closed-loop motion planning task. Codes will be soon released.

PlanAgent: A Multi-modal Large Language Agent for Closed-loop Vehicle Motion Planning

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理