Revisiting Actor-Critic Methods in Discrete Action Off-Policy Reinforcement Learning

作者: Reza Asad, Reza Babanezhad, Sharan Vaswani

分类: cs.LG, cs.AI

发布日期: 2025-09-11

💡 一句话要点

解耦Actor-Critic熵正则化，提升离散动作离策略强化学习性能

🎯 匹配领域: 支柱二：RL算法与架构 (RL & Architecture)

关键词: 离策略强化学习 Actor-Critic方法 熵正则化 离散动作空间 Atari游戏

📋 核心要点

现有离散动作离策略强化学习中，基于策略的方法（如DSAC）性能不佳，主要原因是Actor和Critic熵之间的耦合。
通过解耦Actor和Critic的熵，并引入灵活的Actor-Critic框架，结合m步Bellman算子和策略优化方法，提升性能。
实验表明，该方法在Atari游戏中可以达到与DQN相当的性能，甚至在没有熵正则化的情况下也能实现。

📝 摘要（中文）

在离散动作环境（如Atari）中，基于价值的方法（如DQN）是离策略强化学习的常用方法。常见的基于策略的方法要么是同策略的，无法有效地从离策略数据中学习（如PPO），要么在离散动作环境中表现不佳（如SAC）。因此，本文从离散SAC（DSAC）出发，重新审视了在这种设置下Actor-Critic方法的设计。首先，我们确定Actor和Critic熵之间的耦合是DSAC性能不佳的主要原因。我们证明，仅仅通过解耦这些组件，DSAC就可以达到与DQN相当的性能。受此启发，我们引入了一个灵活的离策略Actor-Critic框架，该框架将DSAC作为一个特例。我们的框架允许对Critic更新使用m步Bellman算子，并能够将标准策略优化方法与熵正则化相结合，以实例化最终的Actor目标。理论上，我们证明了所提出的方法可以保证收敛到表格设置中的最优正则化价值函数。在实验中，我们证明了这些方法可以接近DQN在标准Atari游戏上的性能，甚至在没有熵正则化或显式探索的情况下也能做到。

🔬 方法详解

问题定义：现有离散动作离策略强化学习任务中，基于价值的方法（如DQN）是主流，但基于策略的方法（如SAC）表现不佳。DSAC的性能瓶颈在于Actor和Critic之间的熵耦合，限制了策略的有效学习。

核心思路：核心在于解耦Actor和Critic的熵正则化，允许它们独立地进行优化。通过这种解耦，可以避免Actor和Critic之间的相互干扰，从而提高策略学习的效率和稳定性。此外，引入更灵活的Actor-Critic框架，允许使用m步Bellman算子更新Critic，并结合标准策略优化方法。

技术框架：该框架包含Actor网络和Critic网络。Critic网络使用m步Bellman算子进行更新，以提高价值估计的准确性。Actor网络通过结合标准策略优化方法和熵正则化进行更新，以平衡探索和利用。整体流程包括：从经验回放缓冲区中采样数据，使用Critic网络评估当前策略，使用Actor网络更新策略，并将新的经验存储到经验回放缓冲区中。

关键创新：关键创新在于解耦Actor和Critic的熵正则化，并提出了一个更通用的Actor-Critic框架。该框架允许使用m步Bellman算子更新Critic，并结合标准策略优化方法和熵正则化更新Actor。这种设计使得算法更加灵活，可以适应不同的任务和环境。

关键设计：该框架的关键设计包括：1) 解耦Actor和Critic的熵正则化系数；2) 使用m步Bellman算子更新Critic，其中m是一个可调参数；3) 结合标准策略优化方法（如TRPO或PPO）和熵正则化更新Actor；4) 使用经验回放缓冲区存储经验数据，以便进行离策略学习。具体的损失函数包括Critic的均方误差损失和Actor的策略梯度损失。

🖼️ 关键图片

📊 实验亮点

实验结果表明，通过解耦Actor和Critic的熵正则化，并结合提出的灵活Actor-Critic框架，该方法在Atari游戏上的性能可以接近DQN。更重要的是，即使在没有熵正则化或显式探索的情况下，该方法也能取得良好的性能，表明其具有更强的鲁棒性和泛化能力。

🎯 应用场景

该研究成果可应用于各种离散动作控制任务，例如游戏AI、机器人控制、推荐系统等。通过提升离策略强化学习算法的性能，可以更有效地训练智能体，使其能够在复杂环境中做出更优决策。该方法在资源受限或数据获取成本高的场景下具有重要价值。

📄 摘要（原文）

Value-based approaches such as DQN are the default methods for off-policy reinforcement learning with discrete-action environments such as Atari. Common policy-based methods are either on-policy and do not effectively learn from off-policy data (e.g. PPO), or have poor empirical performance in the discrete-action setting (e.g. SAC). Consequently, starting from discrete SAC (DSAC), we revisit the design of actor-critic methods in this setting. First, we determine that the coupling between the actor and critic entropy is the primary reason behind the poor performance of DSAC. We demonstrate that by merely decoupling these components, DSAC can have comparable performance as DQN. Motivated by this insight, we introduce a flexible off-policy actor-critic framework that subsumes DSAC as a special case. Our framework allows using an m-step Bellman operator for the critic update, and enables combining standard policy optimization methods with entropy regularization to instantiate the resulting actor objective. Theoretically, we prove that the proposed methods can guarantee convergence to the optimal regularized value function in the tabular setting. Empirically, we demonstrate that these methods can approach the performance of DQN on standard Atari games, and do so even without entropy regularization or explicit exploration.

Revisiting Actor-Critic Methods in Discrete Action Off-Policy Reinforcement Learning

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理