Residual Reward Models for Preference-based Reinforcement Learning
作者: Chenyang Cao, Miguel Rogel-García, Mohamed Nabail, Xueqian Wang, Nicholas Rhinehart
分类: cs.LG, cs.AI, cs.RO
发布日期: 2025-07-01
备注: 26 pages, 22 figures
🔗 代码/项目: PROJECT_PAGE
💡 一句话要点
提出残差奖励模型以解决偏好强化学习收敛慢的问题
🎯 匹配领域: 支柱二:RL算法与架构 (RL & Architecture)
关键词: 偏好强化学习 残差奖励模型 策略学习 逆强化学习 机器人控制 性能提升
📋 核心要点
- 现有的偏好强化学习方法在收敛速度上存在不足,尤其是在需要训练奖励模型时。
- 本文提出的残差奖励模型(RRM)通过将真实奖励分为先验奖励和学习奖励,来有效利用先验知识。
- 实验结果显示,RRM在多种任务中显著提高了性能,尤其是在真实机器人上加速了策略学习。
📝 摘要(中文)
偏好强化学习(PbRL)为在奖励信号难以指定的环境中学习高性能策略提供了一种方法,避免了耗时的奖励设计。然而,PbRL可能面临收敛速度慢的问题,因为它需要在奖励模型中进行训练。本文提出了一种有效利用先验知识的残差奖励模型(RRM),将环境的真实奖励分为先验奖励和学习奖励两部分。我们在Meta-World环境套件上评估了RRM的状态和图像版本,实验结果表明,该方法显著提升了PbRL的性能,并在真实机器人Franka Panda上也取得了优异的表现。
🔬 方法详解
问题定义:本文旨在解决偏好强化学习中收敛速度慢的问题,现有方法在使用神经网络时,预训练和微调阶段使用不同损失函数可能导致优化不稳定。
核心思路:提出残差奖励模型(RRM),将真实奖励分为先验奖励(如用户的“最佳猜测”奖励函数)和通过偏好学习得到的学习奖励,从而有效利用已有知识。
技术框架:RRM的整体架构包括两个主要部分:先验奖励的获取和学习奖励的训练。先验奖励可以通过逆强化学习等方法获得,而学习奖励则通过用户偏好进行训练。
关键创新:RRM的核心创新在于将奖励分解为先验和学习两个部分,这种设计使得模型能够更快地收敛,并提高了学习的稳定性。
关键设计:在模型设计中,使用了不同类型的先验奖励,包括代理奖励和逆强化学习获得的奖励,损失函数的选择也经过精心设计,以确保在训练过程中优化的有效性。
🖼️ 关键图片
📊 实验亮点
实验结果表明,RRM显著提高了偏好强化学习的性能,在多种先验奖励类型下均取得了提升,尤其是在真实机器人Franka Panda上,策略学习所需的步骤显著少于基线方法,展示了加速学习的能力。
🎯 应用场景
该研究的潜在应用领域包括机器人控制、自动驾驶、游戏AI等,尤其是在奖励信号难以明确指定的复杂环境中。通过加速策略学习,RRM能够提高智能体在实际应用中的表现,具有重要的实际价值和未来影响。
📄 摘要(原文)
Preference-based Reinforcement Learning (PbRL) provides a way to learn high-performance policies in environments where the reward signal is hard to specify, avoiding heuristic and time-consuming reward design. However, PbRL can suffer from slow convergence speed since it requires training in a reward model. Prior work has proposed learning a reward model from demonstrations and fine-tuning it using preferences. However, when the model is a neural network, using different loss functions for pre-training and fine-tuning can pose challenges to reliable optimization. In this paper, we propose a method to effectively leverage prior knowledge with a Residual Reward Model (RRM). An RRM assumes that the true reward of the environment can be split into a sum of two parts: a prior reward and a learned reward. The prior reward is a term available before training, for example, a user's ``best guess'' reward function, or a reward function learned from inverse reinforcement learning (IRL), and the learned reward is trained with preferences. We introduce state-based and image-based versions of RRM and evaluate them on several tasks in the Meta-World environment suite. Experimental results show that our method substantially improves the performance of a common PbRL method. Our method achieves performance improvements for a variety of different types of prior rewards, including proxy rewards, a reward obtained from IRL, and even a negated version of the proxy reward. We also conduct experiments with a Franka Panda to show that our method leads to superior performance on a real robot. It significantly accelerates policy learning for different tasks, achieving success in fewer steps than the baseline. The videos are presented at https://sunlighted.github.io/RRM-web/.