Large Language Model Unlearning via Embedding-Corrupted Prompts

📄 arXiv: 2406.07933v2 📥 PDF

作者: Chris Yuhao Liu, Yaxuan Wang, Jeffrey Flanigan, Yang Liu

分类: cs.CL, cs.AI, cs.LG

发布日期: 2024-06-12 (更新: 2024-10-31)

备注: NeurIPS 2024 Poster

🔗 代码/项目: GITHUB


💡 一句话要点

提出Embedding-COrrupted Prompts以解决大语言模型知识遗忘问题

🎯 匹配领域: 支柱九:具身大模型 (Embodied Foundation Models)

关键词: 大语言模型 知识遗忘 提示分类器 嵌入腐蚀 零阶优化 数据隐私 模型更新

📋 核心要点

  1. 现有方法在大语言模型中实现知识遗忘面临挑战,尤其是保留与遗忘之间的模糊界限和高计算成本。
  2. 本文提出Embedding-COrrupted Prompts,通过提示分类器在推理阶段实现知识遗忘,避免直接修改模型。
  3. 实验结果显示,该方法在多个领域中实现了有效的知识遗忘,且副作用几乎为零,具有良好的可扩展性。

📝 摘要(中文)

大语言模型(LLMs)在多个领域中取得了显著进展,但控制模型不应知晓的信息对于确保其安全使用至关重要。然而,从LLM中准确高效地遗忘知识仍然面临挑战,尤其是在保留与遗忘之间模糊的边界及大规模模型的计算需求方面。本文提出了一种轻量级的遗忘框架——Embedding-COrrupted (ECO) Prompts,通过在推理过程中使用提示分类器来识别和保护需要遗忘的提示,从而实现遗忘状态。我们通过零阶优化学习添加到提示嵌入中的腐蚀,并在推理时对分类器标记的提示进行腐蚀。实验表明,该方法在一般领域和与遗忘相关的领域中实现了几乎零副作用的有效遗忘。

🔬 方法详解

问题定义:本文旨在解决大语言模型中知识遗忘的难题,现有方法在实现遗忘时可能导致保留与遗忘之间的模糊界限,且计算成本高昂。

核心思路:提出Embedding-COrrupted Prompts,通过在推理过程中使用提示分类器来识别需要遗忘的提示,并在此基础上进行嵌入腐蚀,从而实现知识的有效遗忘。

技术框架:整体框架包括提示分类器、嵌入腐蚀模块和推理阶段。提示分类器负责识别需要遗忘的提示,嵌入腐蚀模块则在推理时对这些提示进行处理。

关键创新:最重要的创新在于通过提示分类器实现的轻量级遗忘机制,避免了对模型本身的直接修改,与现有方法相比,显著降低了计算成本和副作用。

关键设计:在参数设置上,采用零阶优化来学习提示嵌入的腐蚀,损失函数设计为优化遗忘目标,确保输出结果与未训练过相关数据的模型输出相近。具体的网络结构和参数设置在实验中进行了详细验证。

🖼️ 关键图片

fig_0
fig_1
fig_2

📊 实验亮点

实验结果表明,Embedding-COrrupted Prompts在多个领域中实现了有效的知识遗忘,且副作用几乎为零。与基线模型相比,该方法在遗忘效率上有显著提升,适用于从0.5B到236B参数的100个大语言模型,且不增加额外的计算成本。

🎯 应用场景

该研究的潜在应用领域包括数据隐私保护、模型更新和合规性管理等。通过有效地实现知识遗忘,能够提升大语言模型在敏感信息处理中的安全性和可靠性,具有重要的实际价值和未来影响。

📄 摘要(原文)

Large language models (LLMs) have advanced to encompass extensive knowledge across diverse domains. Yet controlling what a large language model should not know is important for ensuring alignment and thus safe use. However, accurately and efficiently unlearning knowledge from an LLM remains challenging due to the potential collateral damage caused by the fuzzy boundary between retention and forgetting, and the large computational requirements for optimization across state-of-the-art models with hundreds of billions of parameters. In this work, we present \textbf{Embedding-COrrupted (ECO) Prompts}, a lightweight unlearning framework for large language models to address both the challenges of knowledge entanglement and unlearning efficiency. Instead of relying on the LLM itself to unlearn, we enforce an unlearned state during inference by employing a prompt classifier to identify and safeguard prompts to forget. We learn corruptions added to prompt embeddings via zeroth order optimization toward the unlearning objective offline and corrupt prompts flagged by the classifier during inference. We find that these embedding-corrupted prompts not only lead to desirable outputs that satisfy the unlearning objective but also closely approximate the output from a model that has never been trained on the data intended for forgetting. Through extensive experiments on unlearning, we demonstrate the superiority of our method in achieving promising unlearning at \textit{nearly zero side effects} in general domains and domains closely related to the unlearned ones. Additionally, we highlight the scalability of our method to 100 LLMs, ranging from 0.5B to 236B parameters, incurring no additional cost as the number of parameters increases. We have made our code publicly available at \url{https://github.com/chrisliu298/llm-unlearn-eco}.