LASH: Adaptive Semantic Hybridization for Black-Box Jailbreaking of Large Language Models

📄 arXiv: 2605.21362v1 📥 PDF

作者: Abdullah Al Nomaan Nafi, Fnu Suya, Swarup Bhunia, Prabuddha Chakraborty

分类: cs.CL

发布日期: 2026-05-20


💡 一句话要点

提出LASH以解决大型语言模型的黑箱越狱问题

🎯 匹配领域: 支柱九:具身大模型 (Embodied Foundation Models)

关键词: 越狱攻击 大型语言模型 黑箱优化 自适应组合 对抗性机器学习 安全性评估

📋 核心要点

  1. 现有的越狱攻击方法通常局限于单一攻击策略,导致在不同目标模型和危害类别中表现不均。
  2. LASH框架通过将多个攻击输出作为种子提示,利用自适应组合技术来提高攻击成功率。
  3. 在JailbreakBench上,LASH在六个常见目标模型上表现优异,平均攻击成功率达到84.5%,超越五个最先进的基线方法。

📝 摘要(中文)

越狱攻击揭示了对齐的大型语言模型的预期安全行为与其在对抗性提示下的实际行为之间的持续差距。现有的自动化方法虽然越来越有效,但各自只专注于单一攻击家族,且没有一种家族在所有场景中占据主导地位。本文提出LASH(LLM自适应语义混合),一个黑箱框架,将多个基础攻击的输出视为可重用的种子提示,并根据每个目标请求自适应地组合它们。LASH在JailbreakBench上评估,平均攻击成功率为84.5%,在三种防御机制下仍具竞争力,表明自适应组合异构越狱策略是黑箱红队测试的有前景方向。

🔬 方法详解

问题定义:本文旨在解决现有越狱攻击方法在不同目标模型和危害类别中表现不一致的问题,现有方法通常只专注于单一攻击策略,限制了其适用性和效果。

核心思路:LASH框架的核心思想是将多个基础攻击的输出视为可重用的种子提示,并根据每个目标请求自适应组合这些提示,以提高攻击的成功率。

技术框架:LASH的整体架构包括种子池的构建、种子子集的搜索、软最大化归一化混合权重的计算、候选提示的合成模块,以及基于黑箱反馈的遗传优化器。

关键创新:LASH的主要创新在于其自适应组合机制,能够在多种攻击策略中进行有效的组合,从而克服了单一策略的局限性,提升了攻击的灵活性和成功率。

关键设计:LASH使用了两阶段的适应性优化过程,结合关键词拒绝检测和LLM评估打分,优化种子提示的组合权重,确保生成的提示既能有效攻击又能规避防御机制。

📊 实验亮点

LASH在JailbreakBench上实现了平均84.5%的攻击成功率,显著优于五个最先进的基线方法,且仅需30个目标查询。此外,LASH在三种防御机制下仍保持竞争力,显示出其在多样化攻击策略中的有效性。

🎯 应用场景

该研究的潜在应用领域包括安全测试、对抗性机器学习和大型语言模型的安全性评估。通过提高对抗攻击的成功率,LASH可以帮助研究人员和开发者更好地理解和改进模型的安全性,减少潜在的滥用风险。未来,该方法可能在自动化安全测试和模型评估中发挥重要作用。

📄 摘要(原文)

Jailbreak attacks expose a persistent gap between the intended safety behavior of aligned large language models and their behavior under adversarial prompting. Existing automated methods are increasingly effective but each commits to a single attack family (e.g., one refinement loop, one tree search, one mutation space, or one strategy library) and no single family dominates: the best-performing method shifts across target models and harm categories, suggesting complementary strengths that per-prompt composition could exploit. We introduce LASH (LLM Adaptive Semantic Hybridization), a black-box framework that treats outputs from multiple base attacks as reusable seed prompts and adaptively composes them for each target request. Given a seed pool, LASH searches over seed subsets and softmax-normalized mixture weights; a composition module synthesizes a single candidate prompt, and a derivative-free genetic optimizer updates the weights using black-box target feedback and a two-stage fitness function combining keyword-based refusal detection with LLM-judge scoring. On JailbreakBench, which contains 100 harmful prompts across 10 categories, we evaluate LASH on six common target models. LASH achieves an average attack success rate of 84.5% under keyword-based evaluation and 74.5% under two-stage evaluation, where responses are first filtered for refusals and then scored by an LLM judge for whether they substantively fulfill the original harmful request. LASH outperforms five state-of-the-art baselines on both metrics with only 30 mean target queries. LASH also remains competitive under three defense mechanisms and induces more success-like internal representations. These results suggest that adaptive composition across heterogeneous jailbreak strategies is a promising direction for black-box red-teaming.