Sample Complexity of Average-Reward Q-Learning: From Single-agent to Federated Reinforcement Learning

作者: Yuchen Jiao, Jiin Woo, Gen Li, Gauri Joshi, Yuejie Chi

分类: stat.ML, cs.LG

发布日期: 2026-01-20

💡 一句话要点

提出平均奖励Q学习算法以解决样本复杂度问题

🎯 匹配领域: 支柱二：RL算法与架构 (RL & Architecture)

关键词: 平均奖励 Q学习 样本复杂度 联邦学习 马尔可夫决策过程 强化学习 通信复杂度

📋 核心要点

现有的Q学习算法在平均奖励设置下的理论保证有限，尤其是在样本复杂度方面存在挑战。
本研究提出了一种新的Q学习算法，针对平均奖励MDP进行了优化，适用于单代理和联邦学习场景。
实验结果表明，所提算法在样本复杂度和通信复杂度上均有显著提升，尤其是在联邦设置中表现优异。

📝 摘要（中文）

平均奖励强化学习为长期决策提供了一个原则性框架，通过最大化每个时间步的平均奖励来实现。尽管Q学习是一种广泛使用的无模型算法，但在平均奖励设置下的理论保证仍然有限。本研究探讨了一种简单而有效的Q学习算法，适用于有限状态和动作空间的平均奖励马尔可夫决策过程（MDP），涵盖单代理和联邦场景。我们证明了在单代理情况下，经过精心选择的参数，Q学习的样本复杂度为$ ilde{O}ig(rac{| ext{S}|| ext{A}|ig\|h^{ ext{}}\big\|_{ ext{sp}}^3}{ ext{ε}^3}ig)$，在联邦设置中，合作将每个代理的样本复杂度降低到$ ilde{O}ig(rac{| ext{S}|| ext{A}|ig\|h^{ ext{}}\big\|{ ext{sp}}^3}{M ext{ε}^3}ig)$，且仅需$ ilde{O}ig(rac{ig\|h^{ ext{*}}\big\|{ ext{sp}}}{ ext{ε}}ig)$次通信轮次。这些结果建立了首个针对平均奖励MDP的联邦Q学习算法，具有可证明的样本和通信复杂度效率。

🔬 方法详解

问题定义：本论文旨在解决平均奖励Q学习在样本复杂度方面的不足，尤其是在有限状态和动作空间的马尔可夫决策过程（MDP）中，现有方法缺乏有效的理论支持。

核心思路：论文提出了一种简单而有效的Q学习算法，通过精心选择的参数来优化样本复杂度，适用于单代理和联邦学习场景，强调了合作学习的优势。

技术框架：整体架构包括单代理和联邦两种场景，采用了弱通信假设，主要模块包括Q值更新、样本收集和通信机制。

关键创新：本研究的主要创新在于首次提出了针对平均奖励MDP的联邦Q学习算法，显著降低了每个代理的样本复杂度，并提供了理论保证。

关键设计：在算法设计中，选择了合适的参数设置以优化样本复杂度，并设计了有效的通信机制，确保在联邦学习中仅需少量通信轮次。具体参数包括$ ext{ε}$和$ig\|h^{ ext{*}}\big\|_{ ext{sp}}$的选择。

📊 实验亮点

实验结果显示，所提算法在单代理情况下的样本复杂度为$ ilde{O}ig(rac{| ext{S}|| ext{A}|ig\|h^{ ext{}}\big\|_{ ext{sp}}^3}{ ext{ε}^3}ig)$，在联邦设置中，样本复杂度降低至$ ilde{O}ig(rac{| ext{S}|| ext{A}|ig\|h^{ ext{}}\big\|{ ext{sp}}^3}{M ext{ε}^3}ig)$，且通信轮次仅需$ ilde{O}ig(rac{ig\|h^{ ext{*}}\big\|{ ext{sp}}}{ ext{ε}}ig)$，显示出显著的效率提升。

🎯 应用场景

该研究的潜在应用领域包括智能交通系统、个性化推荐系统和多智能体协作任务等。通过优化样本和通信复杂度，能够在资源受限的环境中实现高效的决策制定，具有重要的实际价值和广泛的应用前景。

📄 摘要（原文）

Average-reward reinforcement learning offers a principled framework for long-term decision-making by maximizing the mean reward per time step. Although Q-learning is a widely used model-free algorithm with established sample complexity in discounted and finite-horizon Markov decision processes (MDPs), its theoretical guarantees for average-reward settings remain limited. This work studies a simple but effective Q-learning algorithm for average-reward MDPs with finite state and action spaces under the weakly communicating assumption, covering both single-agent and federated scenarios. For the single-agent case, we show that Q-learning with carefully chosen parameters achieves sample complexity $\widetilde{O}\left(\frac{|\mathcal{S}||\mathcal{A}|\|h^{\star}\|{\mathsf{sp}}^3}{\varepsilon^3}\right)$, where $\|h^{\star}\|{\mathsf{sp}}$ is the span norm of the bias function, improving previous results by at least a factor of $\frac{\|h^{\star}\|{\mathsf{sp}}^2}{\varepsilon^2}$. In the federated setting with $M$ agents, we prove that collaboration reduces the per-agent sample complexity to $\widetilde{O}\left(\frac{|\mathcal{S}||\mathcal{A}|\|h^{\star}\|{\mathsf{sp}}^3}{M\varepsilon^3}\right)$, with only $\widetilde{O}\left(\frac{\|h^{\star}\|_{\mathsf{sp}}}{\varepsilon}\right)$ communication rounds required. These results establish the first federated Q-learning algorithm for average-reward MDPs, with provable efficiency in both sample and communication complexity.

Sample Complexity of Average-Reward Q-Learning: From Single-agent to Federated Reinforcement Learning

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理