Evaluating Gender Bias Transfer between Pre-trained and Prompt-Adapted Language Models

作者: Natalie Mackraz, Nivedha Sivakumar, Samira Khorshidi, Krishna Patel, Barry-John Theobald, Luca Zappella, Nicholas Apostoloff

分类: cs.CL, cs.AI, cs.LG

发布日期: 2024-12-04

💡 一句话要点

研究表明预训练LLM的性别偏见在Prompting后依然存在且高度相关

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 语言模型 性别偏见 Prompting 偏见转移 公平性

📋 核心要点

现有研究表明，通过微调适应策略，预训练模型的公平性对下游任务的影响有限，但prompting作为一种高效部署方式，其偏见转移情况尚不明确。
该研究通过考察预训练模型在prompting后的偏见表现，验证了预训练模型中的内在偏见与prompting后的偏见之间存在强相关性。
实验结果表明，即使在改变prompt策略（公平/有偏见）和少样本设置的情况下，预训练模型的偏见仍然会转移到prompting后的模型中。

📝 摘要（中文）

大型语言模型（LLM）越来越多地被调整以适应特定任务，从而部署在现实世界的决策系统中。之前的研究调查了偏见转移假设（BTH），通过研究微调适应策略对模型公平性的影响，发现预训练的掩码语言模型的公平性对使用微调进行调整的模型的公平性影响有限。本文将BTH的研究扩展到prompt适应下的因果模型，因为prompting是一种可访问且计算高效的在现实世界系统中部署模型的方法。与之前的工作相反，我们确定了预训练的Mistral、Falcon和Llama模型中的内在偏见与使用代词共指消解任务对相同模型进行零样本和少样本prompting时的偏见高度相关（rho >= 0.94）。此外，我们发现，即使LLM被专门prompting以表现出公平或有偏见的行为（rho >= 0.92），并且少样本长度和刻板印象组成发生变化（rho >= 0.97），偏见转移仍然高度相关。我们的研究结果强调了确保预训练LLM公平性的重要性，尤其是在它们以后通过prompt适应执行下游任务时。

🔬 方法详解

问题定义：论文旨在研究预训练语言模型（LLM）中存在的性别偏见是否会转移到通过prompting进行适应后的模型中。现有研究主要集中在微调（fine-tuning）对偏见转移的影响，而忽略了prompting这种更轻量级、更易于部署的适应方法。因此，该研究旨在填补这一空白，考察prompting对LLM偏见转移的影响。

核心思路：论文的核心思路是，通过设计一系列实验，评估预训练LLM在经过不同prompting策略（零样本、少样本、公平/有偏见prompt）后，其性别偏见与原始预训练模型中的偏见之间的相关性。如果相关性很高，则表明预训练模型中的偏见会显著转移到prompting后的模型中。

技术框架：该研究的技术框架主要包括以下几个步骤：1) 选择预训练LLM（Mistral, Falcon, Llama）；2) 设计代词共指消解任务，用于评估模型的性别偏见；3) 构建不同的prompting策略，包括零样本、少样本，以及引导模型表现出公平或有偏见行为的prompt；4) 使用不同的prompting策略对预训练模型进行prompting；5) 评估prompting后模型的性别偏见；6) 计算预训练模型和prompting后模型之间的偏见相关性。

关键创新：该研究的关键创新在于，首次系统性地研究了预训练LLM的性别偏见在prompting适应下的转移情况。与以往主要关注微调的研究不同，该研究关注prompting这种更具实用性的适应方法，并发现预训练模型中的偏见会显著转移到prompting后的模型中。

关键设计：关键设计包括：1) 使用代词共指消解任务来量化性别偏见；2) 设计多种prompting策略，包括零样本、少样本，以及引导模型表现出公平或有偏见行为的prompt；3) 使用Spearman相关系数（rho）来衡量预训练模型和prompting后模型之间的偏见相关性。

🖼️ 关键图片

📊 实验亮点

研究发现，预训练的Mistral、Falcon和Llama模型中的内在偏见与使用代词共指消解任务对相同模型进行零样本和少样本prompting时的偏见高度相关（rho >= 0.94）。即使LLM被专门prompting以表现出公平或有偏见的行为，并且少样本长度和刻板印象组成发生变化，偏见转移仍然高度相关（rho >= 0.92 和 rho >= 0.97）。

🎯 应用场景

该研究结果对LLM的公平性研究和应用具有重要意义。它强调了在预训练阶段消除或减轻偏见的重要性，因为这些偏见会转移到下游任务中，即使使用prompting这种轻量级的适应方法。该研究可以指导开发者在构建和部署LLM时，更加关注模型的公平性，避免产生歧视性或不公正的结果。

📄 摘要（原文）

Large language models (LLMs) are increasingly being adapted to achieve task-specificity for deployment in real-world decision systems. Several previous works have investigated the bias transfer hypothesis (BTH) by studying the effect of the fine-tuning adaptation strategy on model fairness to find that fairness in pre-trained masked language models have limited effect on the fairness of models when adapted using fine-tuning. In this work, we expand the study of BTH to causal models under prompt adaptations, as prompting is an accessible, and compute-efficient way to deploy models in real-world systems. In contrast to previous works, we establish that intrinsic biases in pre-trained Mistral, Falcon and Llama models are strongly correlated (rho >= 0.94) with biases when the same models are zero- and few-shot prompted, using a pronoun co-reference resolution task. Further, we find that bias transfer remains strongly correlated even when LLMs are specifically prompted to exhibit fair or biased behavior (rho >= 0.92), and few-shot length and stereotypical composition are varied (rho >= 0.97). Our findings highlight the importance of ensuring fairness in pre-trained LLMs, especially when they are later used to perform downstream tasks via prompt adaptation.

Evaluating Gender Bias Transfer between Pre-trained and Prompt-Adapted Language Models

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理