SequentialBreak: Large Language Models Can be Fooled by Embedding Jailbreak Prompts into Sequential Prompt Chains
作者: Bijoy Ahmed Saiem, MD Sadik Hossain Shanto, Rakib Ahsan, Md Rafi ur Rashid
分类: cs.CR, cs.AI, cs.CL, cs.LG
发布日期: 2024-11-10 (更新: 2025-05-28)
💡 一句话要点
提出SequentialBreak以解决大型语言模型的安全漏洞问题
🎯 匹配领域: 支柱一:机器人控制 (Robot Control) 支柱九:具身大模型 (Embodied Foundation Models)
关键词: 大型语言模型 越狱攻击 安全性测试 提示优化 序列提示链 恶意内容检测 AI安全
📋 核心要点
- 现有的越狱攻击方法在隐藏恶意提示方面存在局限性,容易被检测和防御。
- SequentialBreak通过将恶意提示嵌入无害提示中,利用序列提示链的特性,实现了对LLMs的欺骗。
- 实验结果表明,SequentialBreak在开放源代码和闭源模型上均显著提高了攻击成功率,展示了其有效性。
📝 摘要(中文)
随着大型语言模型(LLMs)在各种应用中的集成增加,其被滥用的可能性也随之上升,带来了显著的安全隐患。现有的越狱攻击主要依赖于场景伪装、提示模糊、提示优化和提示迭代优化等手段来隐藏恶意提示。本文提出SequentialBreak,一种新型越狱攻击,利用序列提示链的脆弱性,将有害提示嵌入无害提示中,从而欺骗LLMs生成有害响应。通过大量实验,SequentialBreak在单个查询中显著提高了攻击成功率,强调了增强LLM安全性和防止潜在滥用的紧迫性。
🔬 方法详解
问题定义:本文旨在解决大型语言模型在面对越狱攻击时的安全漏洞,现有方法在隐藏恶意提示方面存在不足,容易被识别和防御。
核心思路:SequentialBreak的核心思路是利用序列提示链的特性,将有害提示巧妙地嵌入到无害提示中,从而引导LLMs生成有害响应。这样的设计使得恶意提示不易被检测。
技术框架:该方法的整体架构包括多个阶段:首先构建包含无害提示的序列,然后将恶意提示嵌入其中,最后通过单个查询实现对LLMs的欺骗。
关键创新:SequentialBreak的主要创新在于其灵活性,能够适应多种提示格式,并且在单个查询中实现了显著的攻击成功率提升,这与现有方法的逐步优化和多次查询策略形成鲜明对比。
关键设计:在设计上,SequentialBreak关注于提示的结构和顺序,通过精心选择无害提示的内容和排列顺序,确保恶意提示能够有效地影响LLMs的输出。
🖼️ 关键图片
📊 实验亮点
实验结果显示,SequentialBreak在单个查询中实现了显著的攻击成功率提升,相较于现有基线方法,成功率提高了数倍。这一结果表明,SequentialBreak在对抗开放源和闭源模型时均表现出色,显示了其强大的攻击能力。
🎯 应用场景
该研究的潜在应用领域包括安全性测试、恶意内容检测和大型语言模型的防御机制设计。通过识别和防范此类越狱攻击,可以增强LLMs在实际应用中的安全性,降低被滥用的风险。未来,该方法可能推动更为安全的AI系统的开发。
📄 摘要(原文)
As the integration of the Large Language Models (LLMs) into various applications increases, so does their susceptibility to misuse, raising significant security concerns. Numerous jailbreak attacks have been proposed to assess the security defense of LLMs. Current jailbreak attacks mainly rely on scenario camouflage, prompt obfuscation, prompt optimization, and prompt iterative optimization to conceal malicious prompts. In particular, sequential prompt chains in a single query can lead LLMs to focus on certain prompts while ignoring others, facilitating context manipulation. This paper introduces SequentialBreak, a novel jailbreak attack that exploits this vulnerability. We discuss several scenarios, not limited to examples like Question Bank, Dialog Completion, and Game Environment, where the harmful prompt is embedded within benign ones that can fool LLMs into generating harmful responses. The distinct narrative structures of these scenarios show that SequentialBreak is flexible enough to adapt to various prompt formats beyond those discussed. Extensive experiments demonstrate that SequentialBreak uses only a single query to achieve a substantial gain of attack success rate over existing baselines against both open-source and closed-source models. Through our research, we highlight the urgent need for more robust and resilient safeguards to enhance LLM security and prevent potential misuse. All the result files and website associated with this research are available in this GitHub repository: https://anonymous.4open.science/r/JailBreakAttack-4F3B/.