Helmsman of the Masses? Evaluate the Opinion Leadership of Large Language Models in the Werewolf Game

📄 arXiv: 2404.01602v2 📥 PDF

作者: Silin Du, Xiaowei Zhang

分类: cs.CL, cs.AI, cs.HC

发布日期: 2024-04-02 (更新: 2024-08-29)

备注: Published as a conference paper at COLM 2024. 37 pages, 6 figures, 27 tables


💡 一句话要点

评估大型语言模型在狼人游戏中的舆论领导力

🎯 匹配领域: 支柱九:具身大模型 (Embodied Foundation Models)

关键词: 大型语言模型 舆论领导力 狼人游戏 社交推理 多智能体交互 人机协作 评估指标

📋 核心要点

  1. 现有研究对大型语言模型在多智能体和人机交互中的舆论领导力关注不足,限制了其实际应用。
  2. 本文通过狼人游戏引入警长角色,设计了两个新指标来评估LLMs的舆论领导力,提供了新的评估框架。
  3. 实验结果表明,狼人游戏适合评估LLMs的舆论领导力,且大多数LLMs在这一方面的能力有限。

📝 摘要(中文)

大型语言模型(LLMs)在社交推理游戏中展现了显著的战略行为,但其作为舆论领导者的作用尚未得到充分重视。舆论领导者在社交群体中对他人的信念和行为产生显著影响。本文采用狼人游戏作为模拟平台,评估LLMs的舆论领导力。我们开发了一个框架,结合了游戏中的警长角色,并设计了两个新颖的指标来衡量舆论领导者的可靠性和影响力。通过广泛的实验,我们评估了不同规模的LLMs,并收集了狼人问答数据集(WWQA)以增强LLMs对游戏规则的理解。结果表明,狼人游戏是评估LLMs舆论领导力的合适测试平台,且少数LLMs具备舆论领导能力。

🔬 方法详解

问题定义:本文旨在解决大型语言模型在社交推理游戏中舆论领导力评估不足的问题。现有方法未能充分探讨LLMs在多智能体交互中的影响力。

核心思路:通过狼人游戏中的警长角色模拟舆论领导者,设计可靠性和影响力两个指标来量化LLMs的舆论领导力。

技术框架:整体框架包括游戏角色的设计、指标的定义和实验评估三个主要模块。警长角色负责总结论点并推荐决策选项,作为舆论领导者的代理。

关键创新:提出了基于舆论领导者特征的两个新颖指标,分别衡量其可靠性和对其他玩家决策的影响力,这是对现有方法的显著补充。

关键设计:在实验中,设置了不同规模的LLMs进行对比,使用了狼人问答数据集(WWQA)来增强模型对游戏规则的理解,确保评估的全面性和准确性。

🖼️ 关键图片

img_0

📊 实验亮点

实验结果显示,狼人游戏是评估LLMs舆论领导力的有效平台。通过对比不同规模的LLMs,发现只有少数模型具备显著的舆论领导能力,提供了重要的基准数据和分析视角。

🎯 应用场景

该研究的潜在应用领域包括社交媒体分析、在线游戏设计和人机交互系统。通过理解LLMs的舆论领导力,可以优化其在多智能体环境中的表现,提高人机协作的有效性,推动智能系统在社交场景中的应用。

📄 摘要(原文)

Large language models (LLMs) have exhibited memorable strategic behaviors in social deductive games. However, the significance of opinion leadership exhibited by LLM-based agents has been largely overlooked, which is crucial for practical applications in multi-agent and human-AI interaction settings. Opinion leaders are individuals who have a noticeable impact on the beliefs and behaviors of others within a social group. In this work, we employ the Werewolf game as a simulation platform to assess the opinion leadership of LLMs. The game includes the role of the Sheriff, tasked with summarizing arguments and recommending decision options, and therefore serves as a credible proxy for an opinion leader. We develop a framework integrating the Sheriff role and devise two novel metrics based on the critical characteristics of opinion leaders. The first metric measures the reliability of the opinion leader, and the second assesses the influence of the opinion leader on other players' decisions. We conduct extensive experiments to evaluate LLMs of different scales. In addition, we collect a Werewolf question-answering dataset (WWQA) to assess and enhance LLM's grasp of the game rules, and we also incorporate human participants for further analysis. The results suggest that the Werewolf game is a suitable test bed to evaluate the opinion leadership of LLMs, and few LLMs possess the capacity for opinion leadership.