Self-Paced Curriculum Reinforcement Learning for Autonomous Superbike Racing in Simulation

📄 arXiv: 2606.09236v1 📥 PDF

作者: Luca Ghisi, Jacopo Essenziale, Carlo D'Eramo, Matteo Luperto

分类: cs.RO, cs.AI

发布日期: 2026-06-08

备注: Presented at the "1st Workshop on Generalization in Autonomous Driving: Paradigms, Practice, and Public Road Demonstrations" at ICRA 2026, Vienna. Oral+poster presentation


💡 一句话要点

提出自适应课程强化学习以解决自主摩托车赛车问题

🎯 匹配领域: 支柱二:RL算法与架构 (RL & Architecture)

关键词: 自主赛车 深度强化学习 摩托车控制 动态任务生成 虚拟现实模拟

📋 核心要点

  1. 摩托车赛车相比四轮赛车在动态控制上更为复杂,现有方法难以有效应对平衡和倾斜角度的管理。
  2. 提出结合SAC与SPDL的框架,动态生成任务以适应代理的学习进度,避免人工设计课程。
  3. 实验结果显示,SPDL在多个赛道和摩托车模型上均优于SAC,提升了训练效率和驾驶稳定性。

📝 摘要(中文)

自主赛车在深度强化学习(RL)领域取得了显著进展,主要集中在四轮车辆上。然而,摩托车由于需要管理平衡和倾斜角度,增加了复杂性。本文提出了一种框架,通过在VRider SBK模拟器中训练自主赛车代理,结合软演员-评论家(SAC)与自适应课程深度强化学习(SPDL),动态生成逐步更具挑战性的任务。实验结果表明,SPDL在训练效率、圈速和驾驶稳定性方面优于单独使用SAC,为基于RL的自主摩托车赛车建立了初步基线。

🔬 方法详解

问题定义:本文旨在解决自主摩托车赛车中的动态控制问题,现有方法在应对摩托车特有的平衡和倾斜角度管理时存在不足,导致训练效率低下。

核心思路:通过结合软演员-评论家(SAC)与自适应课程深度强化学习(SPDL),动态生成适应代理能力的挑战任务,从而提高学习效率和稳定性。

技术框架:整体架构包括状态空间的构建,奖励信号的设计,以及任务生成模块。状态空间结合了自我感知特征和倾斜角度历史,奖励信号则鼓励沿赛道进展并惩罚不稳定行为。

关键创新:SPDL的动态任务生成机制是本研究的核心创新,与传统的静态课程设计相比,能够更好地适应代理的学习进度,提升训练效果。

关键设计:在参数设置上,奖励信号设计考虑了摩托车特有的动态特性,网络结构采用了适应性调整的策略,以优化学习过程。

🖼️ 关键图片

fig_0
fig_1
fig_2

📊 实验亮点

实验结果表明,SPDL在训练效率上比单独使用SAC提高了显著的性能,具体表现为更短的圈速和更高的驾驶稳定性,建立了基于RL的自主摩托车赛车的初步基线。

🎯 应用场景

该研究的潜在应用领域包括自动驾驶摩托车、虚拟现实赛车游戏以及机器人控制等。通过提高自主摩托车的驾驶能力,该技术可为未来的智能交通系统和娱乐产业带来重要价值,推动相关技术的进步与应用。

📄 摘要(原文)

Autonomous Racing has seen remarkable progress through deep Reinforcement Learning (RL), primarily for four-wheeled vehicles. However, motorbikes introduce substantially greater complexity due to the need to manage balance and lean angle, in addition to more reactive steering and throttle control, and a smaller weight. In this work, we present a framework for training an autonomous agent to race a superbike in VRider SBK, a physics-accurate Unity-based motorbike simulator. Our approach integrates Soft Actor-Critic (SAC) with Self-Paced curriculum Deep reinforcement Learning (SPDL), which dynamically generates progressively more challenging tasks based on the agent's performance, without requiring manual curriculum design. The agent's state space comprises proprioceptive features extended with lean-angle history, along with global track features via course points. The reward signal is shaped to encourage progress along the track while penalizing instability-inducing behaviors specific to two-wheeled dynamics. Preliminary experimental results demonstrate that SPDL outperforms SAC alone in training efficiency, lap time, and driving stability across multiple tracks and motorbike models, establishing a first baseline for RL-based autonomous motorbike racing.