Corruption-robust Offline Multi-agent Reinforcement Learning From Human Feedback

作者: Andi Nika, Debmalya Mandal, Parameswaran Kamalaruban, Adish Singla, Goran Radanović

分类: cs.LG

发布日期: 2026-03-30

💡 一句话要点

提出一种抗干扰的离线多智能体强化学习方法

🎯 匹配领域: 支柱二：RL算法与架构 (RL & Architecture)

关键词: 多智能体强化学习 数据腐蚀 鲁棒性 线性马尔可夫博弈 粗相关均衡 算法优化 离线学习

📋 核心要点

现有的离线多智能体强化学习方法在面对数据腐蚀时表现不佳，难以保证学习的鲁棒性。
本文提出了一种基于线性马尔可夫博弈的鲁棒估计器，能够在数据污染情况下有效估计Nash均衡。
实验结果表明，所提算法在单边覆盖条件下的Nash间隙和CCE间隙均显著优于现有方法，具有较好的计算效率。

📝 摘要（中文）

本文考虑在强污染模型下，离线多智能体强化学习（MARLHF）对数据腐蚀的鲁棒性。给定一个包含轨迹偏好元组的数据集D，其中每个偏好为n维二进制标签向量，可能有ε比例的样本被任意腐蚀。我们使用线性马尔可夫博弈框架建模该问题。在均匀覆盖假设下，提出了一种鲁棒估计器，保证Nash均衡间隙为O(ε^{1 - o(1)})。在更具挑战性的单边覆盖设置中，算法实现了O(√ε)的Nash间隙。为了解决计算复杂性问题，我们将解决方案概念放宽至粗相关均衡（CCE），并在同一单边覆盖条件下推导出一个准多项式时间算法，其CCE间隙为O(√ε)。这是首次系统性处理离线MARLHF中的对抗性数据腐蚀问题。

🔬 方法详解

问题定义：本文解决的是在离线多智能体强化学习中，如何在数据腐蚀的情况下保持学习的鲁棒性。现有方法在面对数据污染时，往往无法有效估计Nash均衡，导致学习效果不佳。

核心思路：论文提出了一种鲁棒估计器，能够在均匀覆盖和单边覆盖条件下，分别保证Nash均衡间隙和CCE间隙的界限，从而提高算法的鲁棒性和计算效率。

技术框架：整体架构基于线性马尔可夫博弈，分为两个主要阶段：首先在均匀覆盖假设下进行鲁棒估计，其次在单边覆盖条件下推导出准多项式时间算法。

关键创新：最重要的技术创新在于首次系统性地处理了离线MARLHF中的对抗性数据腐蚀问题，并提出了相应的鲁棒估计器和算法，显著提升了学习的鲁棒性。

关键设计：在算法设计中，关键参数包括样本的覆盖比例ε，损失函数设计为适应不同覆盖条件的鲁棒性需求，网络结构则采用了适合线性马尔可夫博弈的形式，以提高计算效率。

📊 实验亮点

实验结果显示，所提算法在单边覆盖条件下的Nash间隙达到了O(√ε)，而在均匀覆盖下的鲁棒估计器则保证了O(ε^{1 - o(1)})的性能，相较于传统方法有显著提升，计算效率也得到了优化。

🎯 应用场景

该研究在多智能体系统、机器人协作、智能交通等领域具有广泛的应用潜力。通过提高离线学习的鲁棒性，可以更好地应对现实环境中的数据不确定性，从而提升系统的整体性能和可靠性。

📄 摘要（原文）

We consider robustness against data corruption in offline multi-agent reinforcement learning from human feedback (MARLHF) under a strong-contamination model: given a dataset $D$ of trajectory-preference tuples (each preference being an $n$-dimensional binary label vector representing each of the $n$ agents' preferences), an $ε$-fraction of the samples may be arbitrarily corrupted. We model the problem using the framework of linear Markov games. First, under a uniform coverage assumption - where every policy of interest is sufficiently represented in the clean (prior to corruption) data - we introduce a robust estimator that guarantees an $O(ε^{1 - o(1)})$ bound on the Nash equilibrium gap. Next, we move to the more challenging unilateral coverage setting, in which only a Nash equilibrium and its single-player deviations are covered. In this case, our proposed algorithm achieves an $O(\sqrtε)$ bound on the Nash gap. Both of these procedures, however, suffer from intractable computation. To address this, we relax our solution concept to coarse correlated equilibria (CCE). Under the same unilateral coverage regime, we derive a quasi-polynomial-time algorithm whose CCE gap scales as $O(\sqrtε)$. To the best of our knowledge, this is the first systematic treatment of adversarial data corruption in offline MARLHF.

Corruption-robust Offline Multi-agent Reinforcement Learning From Human Feedback

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理