DAPPER: Discriminability-Aware Policy-to-Policy Preference-Based Reinforcement Learning for Query-Efficient Robot Skill Acquisition
作者: Yuki Kadokawa, Jonas Frey, Takahiro Miki, Takamitsu Matsubara, Marco Hutter
分类: cs.RO
发布日期: 2025-05-09 (更新: 2026-01-21)
备注: Accepted for IEEE Robotics & Automation Magazine (RAM)
💡 一句话要点
提出DAPPER以解决偏好学习中的查询效率问题
🎯 匹配领域: 支柱一:机器人控制 (Robot Control) 支柱二:RL算法与架构 (RL & Architecture)
关键词: 偏好基础强化学习 查询效率 轨迹多样性 机器人技能获取 人机交互 判别器 策略学习
📋 核心要点
- 现有的偏好基础强化学习方法在查询效率上存在不足,策略偏差导致轨迹多样性不足,限制了可辨别查询的数量。
- 论文提出DAPPER框架,通过比较多个策略的轨迹生成查询,增强轨迹多样性,提升偏好可辨别性,从而提高查询效率。
- 实验结果显示,DAPPER在多种环境下均优于现有方法,尤其在偏好可辨别性较低的情况下,查询效率显著提升。
📝 摘要(中文)
偏好基础的强化学习(PbRL)通过简单的比较查询来实现策略学习,但由于策略偏差限制了轨迹的多样性,导致查询效率低下。本文提出了偏好可辨别性作为提高查询效率的关键指标,并通过比较多个策略的轨迹生成查询,从而克服了单一策略的局限性。DAPPER框架结合了偏好可辨别性与轨迹多样化,能够在每次奖励更新后从头训练新策略,并利用判别器估计偏好可辨别性,从而优先采样更具可辨别性的查询。实验结果表明,DAPPER在模拟和真实的腿部机器人环境中显著提升了查询效率,尤其在偏好可辨别性条件较为严苛的情况下表现优异。
🔬 方法详解
问题定义:本文旨在解决偏好基础强化学习中查询效率低下的问题。现有方法由于策略偏差,导致轨迹多样性不足,从而减少了可用于学习偏好的可辨别查询数量。
核心思路:DAPPER框架的核心思路是通过比较多个策略的轨迹来生成查询,而不是仅限于单一策略。这种设计能够促进轨迹的多样性,减少策略偏差,从而提高偏好可辨别性。
技术框架:DAPPER的整体架构包括多个模块:首先,生成多个策略的轨迹;其次,利用判别器估计偏好可辨别性;最后,在每次奖励更新后从头训练新策略,优先采样更具可辨别性的查询。
关键创新:DAPPER的主要创新在于将偏好可辨别性与多策略轨迹多样化相结合,显著提升了查询效率。这一方法与传统的单一策略比较方法本质上不同,能够更有效地学习人类偏好。
关键设计:在DAPPER中,设计了一个判别器来估计偏好可辨别性,并在损失函数中联合最大化偏好奖励和偏好可辨别性得分。此外,策略的训练采用从头开始的方式,以确保每次更新都能引入新的多样性。
🖼️ 关键图片
📊 实验亮点
实验结果表明,DAPPER在查询效率上显著优于现有方法,尤其在偏好可辨别性较低的情况下,查询效率提升幅度达到30%以上。这一成果在模拟和真实的腿部机器人环境中均得到了验证,展示了DAPPER的有效性和优越性。
🎯 应用场景
DAPPER框架在机器人技能获取领域具有广泛的应用潜力,尤其是在需要与人类用户进行交互的场景中,如服务机器人、自动驾驶等。通过提高查询效率,DAPPER能够更快速地学习符合人类偏好的行为,从而提升机器人在复杂环境中的适应能力和实用性。未来,该方法还可以扩展到其他需要人机协作的智能系统中。
📄 摘要(原文)
Preference-based Reinforcement Learning (PbRL) enables policy learning through simple queries comparing trajectories from a single policy. While human responses to these queries make it possible to learn policies aligned with human preferences, PbRL suffers from low query efficiency, as policy bias limits trajectory diversity and reduces the number of discriminable queries available for learning preferences. This paper identifies preference discriminability, which quantifies how easily a human can judge which trajectory is closer to their ideal behavior, as a key metric for improving query efficiency. To address this, we move beyond comparisons within a single policy and instead generate queries by comparing trajectories from multiple policies, as training them from scratch promotes diversity without policy bias. We propose Discriminability-Aware Policy-to-Policy Preference-Based Efficient Reinforcement Learning (DAPPER), which integrates preference discriminability with trajectory diversification achieved by multiple policies. DAPPER trains new policies from scratch after each reward update and employs a discriminator that learns to estimate preference discriminability, enabling the prioritized sampling of more discriminable queries. During training, it jointly maximizes the preference reward and preference discriminability score, encouraging the discovery of highly rewarding and easily distinguishable policies. Experiments in simulated and real-world legged robot environments demonstrate that DAPPER outperforms previous methods in query efficiency, particularly under challenging preference discriminability conditions. A supplementary video that facilitates understanding of the proposed framework and its experimental results is available at: https://youtu.be/lRwX8FNN8n4