Residual Reward Models for Preference-based Reinforcement Learning

作者: Chenyang Cao, Miguel Rogel-García, Mohamed Nabail, Xueqian Wang, Nicholas Rhinehart

分类: cs.LG, cs.AI, cs.RO

发布日期: 2025-07-01

备注: 26 pages, 22 figures

🔗 代码/项目: PROJECT_PAGE

💡 一句话要点

提出残差奖励模型以解决偏好强化学习收敛慢的问题

🎯 匹配领域: 支柱二：RL算法与架构 (RL & Architecture)

关键词: 偏好强化学习 残差奖励模型 策略学习 逆强化学习 机器人控制 性能提升

📋 核心要点

现有的偏好强化学习方法在收敛速度上存在不足，尤其是在需要训练奖励模型时。
本文提出的残差奖励模型（RRM）通过将真实奖励分为先验奖励和学习奖励，来有效利用先验知识。
实验结果显示，RRM在多种任务中显著提高了性能，尤其是在真实机器人上加速了策略学习。

📝 摘要（中文）

偏好强化学习（PbRL）为在奖励信号难以指定的环境中学习高性能策略提供了一种方法，避免了耗时的奖励设计。然而，PbRL可能面临收敛速度慢的问题，因为它需要在奖励模型中进行训练。本文提出了一种有效利用先验知识的残差奖励模型（RRM），将环境的真实奖励分为先验奖励和学习奖励两部分。我们在Meta-World环境套件上评估了RRM的状态和图像版本，实验结果表明，该方法显著提升了PbRL的性能，并在真实机器人Franka Panda上也取得了优异的表现。

🔬 方法详解

问题定义：本文旨在解决偏好强化学习中收敛速度慢的问题，现有方法在使用神经网络时，预训练和微调阶段使用不同损失函数可能导致优化不稳定。

核心思路：提出残差奖励模型（RRM），将真实奖励分为先验奖励（如用户的“最佳猜测”奖励函数）和通过偏好学习得到的学习奖励，从而有效利用已有知识。

技术框架：RRM的整体架构包括两个主要部分：先验奖励的获取和学习奖励的训练。先验奖励可以通过逆强化学习等方法获得，而学习奖励则通过用户偏好进行训练。

关键创新：RRM的核心创新在于将奖励分解为先验和学习两个部分，这种设计使得模型能够更快地收敛，并提高了学习的稳定性。

关键设计：在模型设计中，使用了不同类型的先验奖励，包括代理奖励和逆强化学习获得的奖励，损失函数的选择也经过精心设计，以确保在训练过程中优化的有效性。

🖼️ 关键图片

📊 实验亮点

实验结果表明，RRM显著提高了偏好强化学习的性能，在多种先验奖励类型下均取得了提升，尤其是在真实机器人Franka Panda上，策略学习所需的步骤显著少于基线方法，展示了加速学习的能力。

🎯 应用场景

该研究的潜在应用领域包括机器人控制、自动驾驶、游戏AI等，尤其是在奖励信号难以明确指定的复杂环境中。通过加速策略学习，RRM能够提高智能体在实际应用中的表现，具有重要的实际价值和未来影响。

📄 摘要（原文）

Preference-based Reinforcement Learning (PbRL) provides a way to learn high-performance policies in environments where the reward signal is hard to specify, avoiding heuristic and time-consuming reward design. However, PbRL can suffer from slow convergence speed since it requires training in a reward model. Prior work has proposed learning a reward model from demonstrations and fine-tuning it using preferences. However, when the model is a neural network, using different loss functions for pre-training and fine-tuning can pose challenges to reliable optimization. In this paper, we propose a method to effectively leverage prior knowledge with a Residual Reward Model (RRM). An RRM assumes that the true reward of the environment can be split into a sum of two parts: a prior reward and a learned reward. The prior reward is a term available before training, for example, a user's ``best guess'' reward function, or a reward function learned from inverse reinforcement learning (IRL), and the learned reward is trained with preferences. We introduce state-based and image-based versions of RRM and evaluate them on several tasks in the Meta-World environment suite. Experimental results show that our method substantially improves the performance of a common PbRL method. Our method achieves performance improvements for a variety of different types of prior rewards, including proxy rewards, a reward obtained from IRL, and even a negated version of the proxy reward. We also conduct experiments with a Franka Panda to show that our method leads to superior performance on a real robot. It significantly accelerates policy learning for different tasks, achieving success in fewer steps than the baseline. The videos are presented at https://sunlighted.github.io/RRM-web/.

Residual Reward Models for Preference-based Reinforcement Learning

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理