TrojFM: Resource-efficient Backdoor Attacks against Very Large Foundation Models

📄 arXiv: 2405.16783v1 📥 PDF

作者: Yuzhou. Nie, Yanting. Wang, Jinyuan. Jia, Michael J. De Lucia, Nathaniel D. Bastian, Wenbo. Guo, Dawn. Song

分类: cs.CR, cs.AI, cs.LG

发布日期: 2024-05-27


💡 一句话要点

提出TrojFM以解决大规模基础模型的后门攻击问题

🎯 匹配领域: 支柱九:具身大模型 (Embodied Foundation Models)

关键词: 后门攻击 大型基础模型 模型安全 微调技术 QLoRA 隐蔽性攻击 计算资源优化

📋 核心要点

  1. 现有的后门攻击方法在针对大型基础模型时面临资源限制,无法有效实施。
  2. TrojFM通过微调少量模型参数,开发了一种新颖的后门注入方法,能够在有限资源下进行攻击。
  3. 实验结果显示,TrojFM在不影响模型正常功能的情况下,成功攻击大型GPT风格模型,并在BERT模型上表现优越。

📝 摘要(中文)

在针对大型基础模型的后门攻击中,资源限制是一个关键挑战。现有的后门攻击方法主要针对监督分类器或小型基础模型,未能成功攻破如Llama-3-70B等大型模型。本文提出了TrojFM,一种专为大型基础模型设计的后门攻击方法。其主要技术贡献在于开发了一种新颖的后门注入方法,使得后门模型能够对中毒输入生成相似的隐藏表示,而不受实际语义的影响。该方法仅需微调少量模型参数,能够在有限的计算资源下高效发起任务无关的后门攻击。此外,采用定制的QLoRA技术优化微调过程,使得攻击可以通过一台A100 GPU完成。实验结果表明,TrojFM能够有效攻击广泛使用的大型GPT风格模型,同时保持其正常功能,并在BERT风格模型上超越现有攻击方法。

🔬 方法详解

问题定义:本文旨在解决针对大型基础模型的后门攻击问题,现有方法在资源限制下难以实施有效攻击,尤其是对如Llama-3-70B等模型的攻击尚未成功。

核心思路:TrojFM的核心思路是通过微调极少量的模型参数,开发一种新颖的后门注入方法,使得后门模型对中毒输入生成相似的隐藏表示,从而实现有效的后门攻击。

技术框架:整体架构包括后门注入模块和微调模块。后者通过QLoRA技术优化微调过程,确保在单台A100 GPU上即可完成攻击。

关键创新:TrojFM的主要创新在于其后门注入方法的设计,使得攻击在资源受限的情况下仍然高效且隐蔽,与现有方法相比具有显著的优势。

关键设计:在参数设置上,TrojFM仅微调少量模型参数,采用定制的QLoRA技术以降低计算和内存开销,同时设计了新的触发器注入方法以确保攻击的隐蔽性。

🖼️ 关键图片

fig_0
fig_1
fig_2

📊 实验亮点

实验结果表明,TrojFM能够有效地对大型GPT风格模型实施后门攻击,且在保持模型正常功能的同时,攻击效果优于现有针对BERT风格模型的攻击方法。此外,TrojFM在对抗最新防御机制时表现出较强的韧性,并对关键超参数变化不敏感。

🎯 应用场景

TrojFM的研究成果在安全性和隐私保护领域具有重要应用潜力,尤其是在大型基础模型的安全评估和防护机制设计方面。随着AI技术的不断发展,确保模型的安全性将变得愈发重要,TrojFM为此提供了一种新的思路和方法。

📄 摘要(原文)

One key challenge in backdoor attacks against large foundation models is the resource limits. Backdoor attacks usually require retraining the target model, which is impractical for very large foundation models. Existing backdoor attacks are mainly designed for supervised classifiers or small foundation models (e.g., BERT). None of these attacks has successfully compromised a very large foundation model, such as Llama-3-70B, especially with limited computational resources. In this paper, we propose TrojFM, a novel backdoor attack tailored for very large foundation models. Our primary technical contribution is the development of a novel backdoor injection method. This method forces a backdoored model to generate similar hidden representations for poisoned inputs regardless of their actual semantics. Our approach injects such backdoors by fine-tuning only a very small proportion of model parameters. This enables TrojFM to efficiently launch downstream task-agnostic backdoor attacks against very large foundation models under limited computational resources. Moreover, we optimize the fine-tuning process with our customized QLoRA technique, enabling launching our attack via only~\textit{one A100 GPU}. Furthermore, we design a new trigger injection method to ensure our attack stealthiness. Through extensive experiments, we first demonstrate that TrojFM can launch effective backdoor attacks against widely used large GPT-style models without jeopardizing their normal functionalities (and outperforming existing attacks on BERT-style models). Furthermore, we show that TrojFM is resilient to SOTA defenses and is insensitive to changes in key hyper-parameters. Finally, we conduct a resource analysis to quantify that our method can significantly save computational and memory costs compared to existing backdoor attacks.