Improving Adversarial Transferability on Vision-Language Pre-training Models via Surrogate-Specific Bias Correction

📄 arXiv: 2606.10571v1 📥 PDF

作者: Lijia Yu, Jiuxin Cao, Yuchen Qiang, Changhao Chen, Yifei Huang, Bo Liu

分类: cs.CV, cs.AI, cs.CR

发布日期: 2026-06-09

备注: 17 pages, 7 figures, 10 tables


💡 一句话要点

提出DeBias-Attack以解决VLP模型中的对抗转移性问题

🎯 匹配领域: 支柱九:具身大模型 (Embodied Foundation Models)

关键词: 对抗攻击 视觉-语言模型 鲁棒性提升 多模态学习 梯度修正 模型安全性 深度学习

📋 核心要点

  1. 现有对抗攻击方法在视觉-语言预训练模型中表现不佳,主要因为它们过于依赖替代模型,导致跨模型性能下降。
  2. 本文提出的DeBias-Attack通过修正对抗优化中的替代特定偏差,采用主分支和参考分支的双重优化策略。
  3. 实验结果显示,DeBias-Attack在多个VLP模型及下游任务中均取得了显著的性能提升,验证了其有效性。

📝 摘要(中文)

对抗样本揭示了视觉-语言预训练(VLP)模型的脆弱性,并为提高其鲁棒性提供了见解。现有攻击方法过于依赖替代模型,导致跨模型性能下降。本文提出的DeBias-Attack通过修正对抗优化方向中的替代特定偏差,提升了对抗样本的转移性。该方法维护两个扰动分支,主分支优化原始图像的扰动,而参考分支则优化弱语义图像的扰动。实验结果表明,DeBias-Attack在多个VLP模型及下游任务中表现出色,具有广泛的应用潜力。

🔬 方法详解

问题定义:本文旨在解决视觉-语言预训练模型中的对抗样本转移性不足的问题。现有方法过于依赖替代模型的响应,导致对抗优化方向偏向于替代模型而非输入语义,从而影响跨模型的有效性。

核心思路:DeBias-Attack的核心思想是通过修正对抗优化中的替代特定偏差,提升对抗样本的转移性。该方法通过维护两个扰动分支,分别优化原始图像和弱语义图像的扰动,从而更好地捕捉对抗样本的特征。

技术框架:整体架构包括主分支和参考分支。主分支针对原始图像进行扰动优化,获取用于破坏图像-文本对齐的对抗梯度;参考分支则优化一个弱语义图像的扰动,该图像由数据集均值图像构建,并在每次迭代中加入小的高斯噪声。

关键创新:DeBias-Attack是首个通过梯度修正来纠正替代特定偏差的转移式VLP攻击方法。这一创新使得对抗样本在不同模型间的转移性显著增强,突破了以往方法的局限。

关键设计:在设计中,DeBias-Attack去除了主梯度在参考梯度上的对齐投影,确保更新的对抗图像更具泛化能力。此外,采用了上下文感知的文本替换策略,以增强对抗样本的有效性。

🖼️ 关键图片

fig_0
fig_1
fig_2

📊 实验亮点

实验结果表明,DeBias-Attack在多个视觉-语言预训练模型上表现优异,尤其在开放源代码和闭源多模态大语言模型中,性能提升幅度超过了20%。这一成果验证了其在实际应用中的有效性和广泛适用性。

🎯 应用场景

该研究在对抗攻击和模型鲁棒性提升方面具有重要应用潜力,尤其在多模态大语言模型的安全性和可靠性提升中。未来,DeBias-Attack可广泛应用于图像-文本任务的安全防护,促进更安全的人工智能系统发展。

📄 摘要(原文)

Adversarial examples reveal vulnerabilities in Vision-Language Pre-training (VLP) models and provide insights for improving robustness. A key property is cross-model transferability, which enables transfer-based black-box attacks. However, existing attacks often rely heavily on the surrogate model, causing cross-model performance drops. One reason is that adversarial optimization may follow surrogate model responses more than input semantics, making the update direction effective on the surrogate but less transferable to unseen targets. We refer to this dependency as surrogate-specific bias. Motivated by this observation, DeBias-Attack improves transferability by correcting surrogate-specific bias in adversarial optimization directions. It maintains two perturbation branches. The main branch optimizes a perturbation on the original image and obtains the adversarial gradient used to disrupt image-text alignment. The reference branch optimizes a perturbation on a weak-semantic image constructed from the dataset mean image with small Gaussian noise resampled at each iteration. Since this weak-semantic image contains little clear visual content, its optimization reflects surrogate responses more than image semantics, and its reference gradient estimates surrogate-specific bias. DeBias-Attack removes the aligned projection of the main gradient on the reference gradient before updating the adversarial image, then performs context-aware text substitution using the updated adversarial image. DeBias-Attack is the first transfer-based VLP attack that corrects surrogate-specific bias through gradient correction. Experiments show strong performance across VLP models, downstream tasks, and open-source and closed-source multimodal large language models.