Can Large Language Models Serve as Rational Players in Game Theory? A Systematic Analysis

作者: Caoyun Fan, Jindou Chen, Yaohui Jin, Hao He

分类: cs.AI, cs.CL, cs.GT

发布日期: 2023-12-09 (更新: 2023-12-12)

备注: AAAI 2024

💡 一句话要点

系统分析大型语言模型在博弈论中的理性程度，揭示其与人类的差距

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 大型语言模型 博弈论 理性 社会科学 行为模拟

📋 核心要点

现有研究对LLM在博弈论中的能力边界尚不明确，缺乏系统性的分析。
本研究通过分析LLM在建立意愿、提炼信念和采取行动三个方面的理性程度，评估其在博弈论中的表现。
实验表明，即使是GPT-4在博弈论中也与人类存在显著差异，表明LLM在博弈实验中应用需谨慎。

📝 摘要（中文）

博弈论常被用作分析社会科学研究中人类行为的工具。由于大型语言模型（LLMs）的行为与人类高度一致，一个有前景的研究方向是利用LLMs代替人类进行博弈实验，从而推动社会科学研究。然而，尽管关于LLMs与博弈论结合的实证研究很多，但LLMs在博弈论中的能力边界仍不清楚。本研究旨在系统地分析LLMs在博弈论中的表现。具体而言，理性是博弈论的基本原则，也是评估参与者行为的标准——即建立清晰的意愿、提炼对不确定性的信念以及采取最优行动。因此，我们选择三个经典博弈（独裁者博弈、石头剪刀布和环状网络博弈）来分析LLMs在多大程度上能够在这三个方面实现理性。实验结果表明，即使是当前最先进的LLM（GPT-4）在博弈论中也表现出与人类的显著差异。例如，LLMs难以基于不常见的偏好建立意愿，无法从许多简单模式中提炼信念，并且在采取行动时可能会忽略或修改提炼后的信念。因此，我们认为在社会科学领域将LLMs引入博弈实验应更加谨慎。

🔬 方法详解

问题定义：论文旨在评估大型语言模型（LLMs）在博弈论中的理性程度。现有方法缺乏对LLMs在博弈论中能力边界的系统性分析，无法确定LLMs是否可以可靠地替代人类参与博弈实验。这阻碍了LLMs在社会科学研究中的应用。

核心思路：论文的核心思路是将博弈论的理性原则分解为三个关键方面：建立清晰的意愿、提炼对不确定性的信念以及采取最优行动。通过分析LLMs在这三个方面的表现，评估其在博弈论中的理性程度。这种分解方法能够更细致地了解LLMs在博弈论中的优势和局限性。

技术框架：论文选择三个经典的博弈论游戏作为实验平台：独裁者博弈、石头剪刀布和环状网络博弈。对于每个游戏，设计特定的prompt，引导LLMs参与博弈。通过分析LLMs在不同游戏中的行为，评估其在建立意愿、提炼信念和采取行动三个方面的表现。实验结果与人类行为进行对比，从而评估LLMs的理性程度。

关键创新：论文最重要的技术创新在于对LLMs在博弈论中理性程度的系统性分析框架。该框架将理性分解为三个可评估的方面，并设计了相应的实验来评估LLMs的表现。与以往的经验性研究相比，该框架能够更深入地了解LLMs在博弈论中的能力边界。

关键设计：在独裁者博弈中，通过改变LLMs的偏好设置（例如，给予不同的效用函数），来测试其建立意愿的能力。在石头剪刀布游戏中，通过分析LLMs在多次迭代中的选择模式，来评估其提炼信念的能力。在环状网络博弈中，通过分析LLMs在复杂网络中的策略选择，来评估其采取最优行动的能力。prompt的设计至关重要，需要确保能够清晰地引导LLMs参与博弈，并能够准确地捕捉其行为。

📊 实验亮点

实验结果表明，即使是GPT-4在博弈论中也表现出与人类的显著差异。例如，LLMs难以基于不常见的偏好建立意愿，无法从许多简单模式中提炼信念，并且在采取行动时可能会忽略或修改提炼后的信念。这些发现强调了在社会科学领域将LLMs引入博弈实验时需要谨慎。

🎯 应用场景

该研究成果可应用于评估和改进LLMs在模拟人类行为方面的能力，尤其是在社会科学、经济学和行为科学等领域。通过了解LLMs在博弈论中的局限性，可以更谨慎地将其应用于社会模拟和决策支持系统。未来的研究可以探索如何提高LLMs的理性程度，使其更可靠地替代人类参与博弈实验。

📄 摘要（原文）

Game theory, as an analytical tool, is frequently utilized to analyze human behavior in social science research. With the high alignment between the behavior of Large Language Models (LLMs) and humans, a promising research direction is to employ LLMs as substitutes for humans in game experiments, enabling social science research. However, despite numerous empirical researches on the combination of LLMs and game theory, the capability boundaries of LLMs in game theory remain unclear. In this research, we endeavor to systematically analyze LLMs in the context of game theory. Specifically, rationality, as the fundamental principle of game theory, serves as the metric for evaluating players' behavior -- building a clear desire, refining belief about uncertainty, and taking optimal actions. Accordingly, we select three classical games (dictator game, Rock-Paper-Scissors, and ring-network game) to analyze to what extent LLMs can achieve rationality in these three aspects. The experimental results indicate that even the current state-of-the-art LLM (GPT-4) exhibits substantial disparities compared to humans in game theory. For instance, LLMs struggle to build desires based on uncommon preferences, fail to refine belief from many simple patterns, and may overlook or modify refined belief when taking actions. Therefore, we consider that introducing LLMs into game experiments in the field of social science should be approached with greater caution.