ETP-R1: Evolving Topological Planning with Reinforcement Fine-tuning for Vision-Language Navigation in Continuous Environments
作者: Shuhao Ye, Sitong Mao, Yuxiang Cui, Xuan Yu, Shichao Zhai, Wen Chen, Shunbo Zhou, Rong Xiong, Yue Wang
分类: cs.RO
发布日期: 2025-12-24
备注: 8 pages, 6 figures
🔗 代码/项目: GITHUB
💡 一句话要点
ETP-R1:融合强化微调的演化拓扑规划,用于连续环境下的视觉-语言导航
🎯 匹配领域: 支柱九:具身大模型 (Embodied Foundation Models)
关键词: 视觉语言导航 连续环境 拓扑规划 强化微调 大规模预训练 机器人导航 Group Relative Policy Optimization
📋 核心要点
- 现有基于图的VLN-CE方法在利用大规模数据和先进训练范式方面落后于基于LVLMs的方法。
- ETP-R1通过构建大规模预训练数据集和引入强化微调,提升了基于图的VLN-CE模型的性能。
- 实验结果表明,ETP-R1在R2R-CE和RxR-CE基准测试中取得了新的state-of-the-art性能。
📝 摘要(中文)
本文提出ETP-R1框架,旨在弥合基于图的视觉-语言导航(VLN-CE)方法与基于大型视觉-语言模型(LVLMs)的方法之间的差距,通过将数据规模化和强化微调(RFT)应用于基于图的VLN-CE模型。首先,利用Gemini API构建高质量、大规模的预训练数据集,该数据集包含多样化的、低幻觉的拓扑轨迹指令,为基于图的策略提供丰富的监督,以将语言映射到拓扑路径。通过统一来自R2R和RxR任务的数据进行联合预训练,进一步加强了这一基础。在此基础上,引入了一个三阶段的训练范式,最终将闭环、在线RFT首次应用于基于图的VLN-CE模型,并由Group Relative Policy Optimization (GRPO)算法提供支持。大量实验表明,该方法非常有效,在R2R-CE和RxR-CE基准测试中都建立了新的最先进性能。
🔬 方法详解
问题定义:视觉-语言导航在连续环境中面临挑战,需要智能体根据自然语言指令在连续环境中导航到目标位置。现有基于图的方法虽然高效,但难以充分利用大规模数据和先进的训练方法,导致性能受限。
核心思路:ETP-R1的核心思路是将数据规模化和强化微调(RFT)的优势引入到基于图的VLN-CE模型中。通过大规模预训练提供丰富的监督信号,并通过强化微调进一步优化策略,从而提升导航性能。
技术框架:ETP-R1采用三阶段训练范式。第一阶段,使用Gemini API生成大规模预训练数据集,并进行联合预训练。第二阶段,进行监督微调。第三阶段,采用闭环、在线强化微调,利用Group Relative Policy Optimization (GRPO)算法优化策略。整体框架包括数据生成、预训练、监督微调和强化微调四个主要模块。
关键创新:ETP-R1的关键创新在于首次将闭环、在线强化微调应用于基于图的VLN-CE模型。此外,利用Gemini API生成大规模、高质量的预训练数据也是一个重要创新。
关键设计:预训练数据集包含多样化的拓扑轨迹指令,旨在减少幻觉。强化微调阶段采用Group Relative Policy Optimization (GRPO)算法,以提高策略的稳定性。具体参数设置和网络结构细节在论文中未详细说明,属于未知信息。
🖼️ 关键图片
📊 实验亮点
ETP-R1在R2R-CE和RxR-CE基准测试中取得了新的state-of-the-art性能。具体性能数据和提升幅度在论文中未给出明确数值,属于未知信息。但摘要强调了该方法在所有主要指标上均优于现有方法。
🎯 应用场景
ETP-R1的研究成果可应用于机器人导航、智能助手、虚拟现实等领域。该方法能够提升机器人在复杂环境中的导航能力,使其能够更好地理解人类指令并完成导航任务。未来,该研究有望推动机器人技术在家庭服务、物流运输等领域的应用。
📄 摘要(原文)
Vision-Language Navigation in Continuous Environments (VLN-CE) requires an embodied agent to navigate towards target in continuous environments, following natural language instructions. While current graph-based methods offer an efficient, structured approach by abstracting the environment into a topological map and simplifying the action space to waypoint selection, they lag behind methods based on Large Vision-Language Models (LVLMs) in leveraging large-scale data and advanced training paradigms. In this paper, we try to bridge this gap by introducing ETP-R1, a framework that applies the paradigm of scaling up data and Reinforcement Fine-Tuning (RFT) to a graph-based VLN-CE model. To build a strong foundation, we first construct a high-quality, large-scale pretraining dataset using the Gemini API. This dataset consists of diverse, low-hallucination instructions for topological trajectories, providing rich supervision for our graph-based policy to map language to topological paths. This foundation is further strengthened by unifying data from both R2R and RxR tasks for joint pretraining. Building on this, we introduce a three-stage training paradigm, which culminates in the first application of closed-loop, online RFT to a graph-based VLN-CE model, powered by the Group Relative Policy Optimization (GRPO) algorithm. Extensive experiments demonstrate that our approach is highly effective, establishing new state-of-the-art performance across all major metrics on both the R2R-CE and RxR-CE benchmarks. Our code is available at https://github.com/Cepillar/ETP-R1.