AdaptiveGuard: Towards Adaptive Runtime Safety for LLM-Powered Software

作者: Rui Yang, Michael Fu, Chakkrit Tantithamthavorn, Chetan Arora, Gunel Gulmammadova, Joey Chua

分类: cs.CR, cs.AI, cs.SE

发布日期: 2025-09-21

备注: Accepted to the ASE 2025 International Conference on Automated Software Engineering, Industry Showcase Track

🔗 代码/项目: GITHUB

💡 一句话要点

AdaptiveGuard：面向LLM软件的自适应运行时安全防护

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 大型语言模型安全 越狱攻击防御 自适应学习 运行时安全 分布外检测

📋 核心要点

现有LLM guardrails面对新型越狱攻击时性能显著下降，无法有效应对不断演进的威胁。
AdaptiveGuard通过持续学习框架，将新型攻击识别为分布外数据，并自适应地学习防御策略。
实验表明，AdaptiveGuard在OOD检测、快速适应和性能保持方面均优于现有方法。

📝 摘要（中文）

Guardrails对于保障基于大型语言模型（LLM）软件的安全部署至关重要。与输入输出空间受限的传统规则系统不同，LLM支持开放式、智能的交互，但也容易遭受用户输入的越狱攻击。Guardrails作为保护层，过滤不安全的提示。然而，现有研究表明，即使是GPT-4o等先进模型，越狱攻击的成功率仍然超过70%。虽然LlamaGuard等guardrails的准确率高达95%，但我们的初步分析表明，面对未知的攻击，其性能会急剧下降至12%。这突显了一个日益严峻的软件工程挑战：如何构建一个能够动态适应新威胁的部署后guardrail？为了解决这个问题，我们提出了AdaptiveGuard，一种自适应guardrail，它将新的越狱攻击检测为分布外（OOD）输入，并通过持续学习框架学习防御它们。实验评估表明，AdaptiveGuard实现了96%的OOD检测准确率，仅需两个更新步骤即可适应新的攻击，并在适应后保持超过85%的F1分数的分布内数据性能，优于其他基线。这些结果表明，AdaptiveGuard是一种能够响应部署后出现的新越狱策略的guardrail。我们发布了AdaptiveGuard和研究数据集，以支持进一步的研究。

🔬 方法详解

问题定义：论文旨在解决大型语言模型（LLM）驱动的软件在部署后，面对不断涌现的越狱攻击时，现有guardrails防御能力不足的问题。现有方法，如基于规则的系统或预训练的guardrails，难以泛化到未知的攻击模式，导致安全防护效果显著下降。

核心思路：AdaptiveGuard的核心思路是将新型越狱攻击视为分布外（Out-of-Distribution, OOD）的输入，利用OOD检测技术识别这些攻击，并采用持续学习的方法，使guardrail能够动态适应并学习防御新的攻击模式。这种自适应性是应对LLM安全挑战的关键。

技术框架：AdaptiveGuard包含以下主要模块：1) OOD检测模块：用于识别输入提示是否为新型越狱攻击。2) 持续学习模块：当检测到OOD攻击时，该模块会利用新的攻击样本更新guardrail的模型参数，使其能够防御该攻击。3) Guardrail模型：负责对输入提示进行安全评估，判断其是否包含恶意内容。整体流程是：输入提示首先经过OOD检测，如果被判定为OOD，则触发持续学习模块进行模型更新，然后Guardrail模型使用更新后的参数进行安全评估。

关键创新：AdaptiveGuard的关键创新在于其自适应性，能够动态地学习和适应新的越狱攻击模式。与传统的静态guardrails相比，AdaptiveGuard能够更好地应对LLM安全领域的持续演进的威胁。此外，将OOD检测与持续学习相结合，使得guardrail能够在不影响原有性能的前提下，快速有效地学习新的防御策略。

关键设计：OOD检测模块可以使用多种方法实现，例如基于距离的方法（如Mahalanobis距离）或基于密度的方法。持续学习模块可以使用多种算法，例如iCaRL或EWC。Guardrail模型可以使用预训练的语言模型，例如LlamaGuard，并进行微调。论文中可能详细描述了这些模块的具体实现方式和参数设置，例如OOD检测的阈值、持续学习的学习率、以及Guardrail模型的微调策略等。这些细节对于实际部署AdaptiveGuard至关重要。

🖼️ 关键图片

📊 实验亮点

AdaptiveGuard在实验中表现出色，实现了96%的OOD检测准确率，能够有效识别新型越狱攻击。仅需两个更新步骤，AdaptiveGuard即可适应新的攻击模式，展现了快速适应能力。在适应新攻击后，AdaptiveGuard仍能保持超过85%的F1分数，证明其在学习新知识的同时，能够有效保留原有性能。AdaptiveGuard在各项指标上均优于其他基线方法，验证了其有效性和优越性。

🎯 应用场景

AdaptiveGuard可广泛应用于各种基于LLM的软件系统中，例如聊天机器人、代码生成工具、内容创作平台等。通过提供自适应的运行时安全防护，AdaptiveGuard能够有效降低LLM软件遭受越狱攻击的风险，保障用户安全和系统稳定。该研究对于推动LLM技术的安全可靠应用具有重要意义。

📄 摘要（原文）

Guardrails are critical for the safe deployment of Large Language Models (LLMs)-powered software. Unlike traditional rule-based systems with limited, predefined input-output spaces that inherently constrain unsafe behavior, LLMs enable open-ended, intelligent interactions--opening the door to jailbreak attacks through user inputs. Guardrails serve as a protective layer, filtering unsafe prompts before they reach the LLM. However, prior research shows that jailbreak attacks can still succeed over 70% of the time, even against advanced models like GPT-4o. While guardrails such as LlamaGuard report up to 95% accuracy, our preliminary analysis shows their performance can drop sharply--to as low as 12%--when confronted with unseen attacks. This highlights a growing software engineering challenge: how to build a post-deployment guardrail that adapts dynamically to emerging threats? To address this, we propose AdaptiveGuard, an adaptive guardrail that detects novel jailbreak attacks as out-of-distribution (OOD) inputs and learns to defend against them through a continual learning framework. Through empirical evaluation, AdaptiveGuard achieves 96% OOD detection accuracy, adapts to new attacks in just two update steps, and retains over 85% F1-score on in-distribution data post-adaptation, outperforming other baselines. These results demonstrate that AdaptiveGuard is a guardrail capable of evolving in response to emerging jailbreak strategies post deployment. We release our AdaptiveGuard and studied datasets at https://github.com/awsm-research/AdaptiveGuard to support further research.

AdaptiveGuard: Towards Adaptive Runtime Safety for LLM-Powered Software

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理