Can You Keep a Secret? Involuntary Information Leakage in Language Model Writing

作者: Ari Holtzman, Peter West

分类: cs.CR, cs.AI

发布日期: 2026-05-11

💡 一句话要点

揭示语言模型写作中的无意信息泄露现象

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 语言模型 信息泄露 秘密词 模型测试 数据安全

📋 核心要点

现有的语言模型在特定应用中面临信息泄露问题，用户无法确保系统提示和敏感数据不被曝光。
论文通过设计实验，让模型在写作中保守秘密，考察其对信息泄露的能力与表现，提供实证数据。
实验结果显示，尽管秘密词未直接输出，但模型在内容主题上存在显著泄露，最高泄露率达到79%。

📝 摘要（中文）

本研究探讨了语言模型在写作中对提示信息的泄露能力。作者通过让模型在写作时保守一个秘密词，测试其是否能避免在输出中透露该词。结果表明，尽管秘密词未在文本中显式出现，所有测试的前沿模型仍然在主题选择、意象和环境设定等方面以不同于随机的比率泄露信息，最高可达79%。当模型被指示主动隐藏秘密时，其写作方向也呈现出可识别的偏离。同时，泄露现象与模型规模呈现明显相关性，对短篇写作如笑话则不会造成泄露。

🔬 方法详解

问题定义：本研究旨在解决语言模型在写作过程中信息泄露的问题。现有的方法无法有效阻止模型在特定场景下泄露敏感提示信息，尤其是在长期文本生成时。

核心思路：作者通过向模型提供一个秘密词并指示其保守秘密，在写作中测试模型对于泄露信息的能力，探讨模型如何在不直接输出秘密词的情况下仍然泄露相关信息。

技术框架：整体实验框架分为两个部分：第一部分是设定模型进行秘密词相关的写作，第二部分是通过另一模型进行二元分类测试以识别泄露的信息。

关键创新：最重要的技术创新在于发现即便模型不能直接输出秘密词，其在主题选择和叙事上仍然会以非随机的方式泄露信息，此现象在不同模型间具可读性。

关键设计：实验设计中包括对模型进行秘密词的提示，使用多种写作风格，测量其信息泄露的程度与模型规模之间的关系。结果显示，对于短文写作，信息泄露现象会显著下降。

🖼️ 关键图片

📊 实验亮点

实验结果表明，5个前沿模型在主题和意象上的信息泄露率高达79%，且随着模型规模的增加，泄露现象显著加剧。相较于无特定任务时的情况，主动隐藏秘密的模型仍难以避免其叙述结构展现出可泄露的信息，突出隐私保护的必要性。

🎯 应用场景

该研究对保护敏感信息在语言生成任务中的应用具有重要意义。它能够启发模型开发者在设计AI写作工具时更加注重信息安全，尤其是在涉及机密和隐私的场合。此外，未来可以在医疗、金融等对信息保护要求高的领域找寻更实用的应用。

📄 摘要（原文）

Language models are deployed in settings that require compartmentalization: system prompts should not be disclosed, chain-of-thought reasoning is hidden from users, and sensitive data passes through shared contexts. We test whether models can keep prompted information out of their writing. We give each model a secret word with instructions not to reveal it, then ask it to write a story. A second model tries to identify the secret from the story in a binary discrimination test. The secret word never appears literally in any output, but all five frontier models we test leak it thematically -- through topic choice, imagery, and setting--6hy-at rates significantly different from chance, up to 79\%. When told to actively hide the secret, models write \emph{away from} it, and this avoidance is itself detectable. The leakage is cross-model readable, scales sharply with model size within two model families, and disappears entirely for short-form writing like jokes. Giving the model a decoy concept to ``focus on instead'' partially redirects the leakage from the real secret to the decoy. Attending to a secret appears to open up an information channel that frontier LLMs cannot close, even when instructed to.

Can You Keep a Secret? Involuntary Information Leakage in Language Model Writing

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理