Why Are My Prompts Leaked? Unraveling Prompt Extraction Threats in Customized Large Language Models

作者: Zi Liang, Haibo Hu, Qingqing Ye, Yaxin Xiao, Haoyang Li

分类: cs.CL, cs.CR

发布日期: 2024-08-05 (更新: 2025-02-12)

备注: Source Code: https://github.com/liangzid/PromptExtractionEval

🔗 代码/项目: GITHUB

💡 一句话要点

揭示定制大语言模型中的提示词泄露风险并提出防御策略

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 提示词泄露 大语言模型 提示词提取 安全对齐 防御策略

📋 核心要点

现有基于提示词的定制大语言模型服务面临提示词泄露的风险，威胁知识产权和引发下游攻击。
论文分析了提示词泄露的根本原因，提出了基于困惑度和token翻译路径的两种泄露假设。
实验表明现有模型易受攻击，并提出了有效的防御策略，显著降低了提示词提取率。

📝 摘要（中文）

随着大语言模型（LLMs）参数的急剧增加，通过提示词（即任务描述）进行免微调的下游定制成为新的研究方向。虽然这些基于提示词的服务（例如OpenAI的GPTs）在许多业务中发挥着重要作用，但提示词泄露的问题日益受到关注，这损害了这些服务的知识产权并导致下游攻击。本文分析了提示词泄露的潜在机制，我们称之为提示词记忆，并开发了相应的防御策略。通过探索提示词提取中的缩放规律，我们分析了影响提示词提取的关键属性，包括模型大小、提示词长度以及提示词的类型。然后，我们提出了两个假设来解释LLM如何暴露其提示词。第一个假设归因于困惑度，即LLM对文本的熟悉程度，而第二个假设基于注意力矩阵中直接的token翻译路径。为了防御此类威胁，我们研究了对齐是否可以削弱提示词的提取。我们发现，即使是像GPT-4这样具有安全对齐的当前LLM，也很容易受到提示词提取攻击，即使在最直接的用户攻击下也是如此。因此，我们根据我们的发现提出了几种防御策略，分别使Llama2-7B和GPT-3.5的提示词提取率降低了83.8％和71.0％。源代码可在https://github.com/liangzid/PromptExtractionEval获得。

🔬 方法详解

问题定义：论文旨在解决定制大语言模型中提示词泄露的问题。现有方法缺乏对提示词泄露机制的深入理解和有效的防御措施，导致模型容易受到攻击，知识产权受到威胁。

核心思路：论文的核心思路是分析提示词泄露的根本原因，即提示词记忆，并基于此提出相应的防御策略。通过研究模型大小、提示词长度和类型等因素对提示词提取的影响，揭示泄露的内在机制。

技术框架：论文的研究框架主要包括三个部分：1) 提示词泄露的分析，包括提出基于困惑度和token翻译路径的两种泄露假设；2) 提示词提取攻击的评估，验证现有模型的脆弱性；3) 防御策略的提出和评估，旨在降低提示词提取率。

关键创新：论文的关键创新在于：1) 提出了提示词记忆的概念，并分析了其内在机制；2) 提出了基于困惑度和token翻译路径的两种泄露假设，为理解提示词泄露提供了新的视角；3) 设计了有效的防御策略，显著降低了提示词提取率。

关键设计：论文的关键设计包括：1) 使用困惑度来衡量模型对提示词的熟悉程度；2) 分析注意力矩阵中的token翻译路径，揭示提示词的暴露方式；3) 设计了多种防御策略，例如基于对抗训练和提示词混淆的方法，以降低提示词提取率。具体的参数设置和损失函数等细节在论文中进行了详细描述，但未在此处详细展开。

🖼️ 关键图片

📊 实验亮点

实验结果表明，现有的大语言模型（包括GPT-4）容易受到提示词提取攻击。论文提出的防御策略能够显著降低提示词提取率，例如，对于Llama2-7B，提示词提取率降低了83.8％，对于GPT-3.5，提示词提取率降低了71.0％。这些结果表明，该研究提出的防御策略是有效的。

🎯 应用场景

该研究成果可应用于各种基于提示词的定制大语言模型服务，例如OpenAI的GPTs。通过部署论文提出的防御策略，可以有效保护模型的知识产权，防止提示词泄露导致的下游攻击，从而提高服务的安全性和可靠性。未来，该研究可以进一步扩展到其他类型的模型和应用场景。

📄 摘要（原文）

The drastic increase of large language models' (LLMs) parameters has led to a new research direction of fine-tuning-free downstream customization by prompts, i.e., task descriptions. While these prompt-based services (e.g. OpenAI's GPTs) play an important role in many businesses, there has emerged growing concerns about the prompt leakage, which undermines the intellectual properties of these services and causes downstream attacks. In this paper, we analyze the underlying mechanism of prompt leakage, which we refer to as prompt memorization, and develop corresponding defending strategies. By exploring the scaling laws in prompt extraction, we analyze key attributes that influence prompt extraction, including model sizes, prompt lengths, as well as the types of prompts. Then we propose two hypotheses that explain how LLMs expose their prompts. The first is attributed to the perplexity, i.e. the familiarity of LLMs to texts, whereas the second is based on the straightforward token translation path in attention matrices. To defend against such threats, we investigate whether alignments can undermine the extraction of prompts. We find that current LLMs, even those with safety alignments like GPT-4, are highly vulnerable to prompt extraction attacks, even under the most straightforward user attacks. Therefore, we put forward several defense strategies with the inspiration of our findings, which achieve 83.8\% and 71.0\% drop in the prompt extraction rate for Llama2-7B and GPT-3.5, respectively. Source code is avaliable at https://github.com/liangzid/PromptExtractionEval.

Why Are My Prompts Leaked? Unraveling Prompt Extraction Threats in Customized Large Language Models

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理