Provable Robustness in Multimodal Large Language Models via Feature Space Smoothing
作者: Song Xia, Meiwen Ding, Chenqi Kong, Wenhan Yang, Xudong Jiang
分类: cs.LG, cs.CV
发布日期: 2026-01-22
备注: Under review
💡 一句话要点
提出特征空间平滑方法以增强多模态大语言模型的鲁棒性
🎯 匹配领域: 支柱九:具身大模型 (Embodied Foundation Models)
关键词: 多模态大语言模型 特征空间平滑 对抗性攻击 鲁棒性增强 净化器 高斯鲁棒性 特征余弦相似度 机器学习
📋 核心要点
- 现有多模态大语言模型在对抗性攻击下表现脆弱,容易受到扰动影响,导致错误预测。
- 提出特征空间平滑(FS)方法,通过转化特征编码器来增强模型的鲁棒性,并引入PSM模块进一步提升性能。
- 实验表明,FS与PSM结合后,相较于对抗训练,攻击成功率显著降低,验证了方法的有效性。
📝 摘要(中文)
多模态大语言模型(MLLMs)在多种应用中表现出强大的能力,但仍然容易受到对抗性扰动的影响,从而导致特征表示失真和错误预测。为了解决这一脆弱性,本文提出了特征空间平滑(FS)方法,并理论证明FS在MLLMs的特征表示上提供了认证鲁棒性。FS将任何特征编码器转化为平滑变体,确保在$ ext{l}_2$约束攻击下,干净和对抗表示之间的特征余弦相似度保持认证下限。此外,本文还引入了净化器和平滑映射器(PSM)模块,能够在不需要重新训练的情况下提升MLLMs的高斯鲁棒性。实验结果表明,FS与PSM结合后,攻击成功率从近90%降低至约1%。
🔬 方法详解
问题定义:本文旨在解决多模态大语言模型在对抗性攻击下的脆弱性,现有方法未能有效保证特征表示的鲁棒性,导致模型易受攻击。
核心思路:提出特征空间平滑(FS)方法,通过对特征编码器进行平滑处理,确保在对抗性攻击下特征余弦相似度的认证下限,从而增强模型的鲁棒性。
技术框架:整体架构包括特征空间平滑模块和净化器和平滑映射器(PSM),FS模块负责特征编码器的平滑处理,PSM则在不重新训练的情况下提升高斯鲁棒性。
关键创新:FS方法提供了理论上的鲁棒性保证,并通过PSM模块实现了对现有模型的增强,区别于传统的对抗训练方法。
关键设计:在FS中,定义了高斯鲁棒性评分,并通过调整该评分来优化特征余弦相似度界限,确保在$ ext{l}_2$攻击下的有效性。
🖼️ 关键图片
📊 实验亮点
实验结果显示,结合FS与PSM后,模型在多种白盒攻击下的攻击成功率从近90%显著降低至约1%,展现出优越的理论鲁棒性和实证性能,超越了传统的对抗训练方法。
🎯 应用场景
该研究的潜在应用领域包括自然语言处理、计算机视觉等多模态任务,能够有效提升模型在对抗性环境下的稳定性和可靠性,具有重要的实际价值和广泛的应用前景。
📄 摘要(原文)
Multimodal large language models (MLLMs) exhibit strong capabilities across diverse applications, yet remain vulnerable to adversarial perturbations that distort their feature representations and induce erroneous predictions. To address this vulnerability, we propose the Feature-space Smoothing (FS) and theoretically prove that FS offers certified robustness on the feature representations of MLLMs. Specifically, FS transforms any feature encoder into a smoothed variant that is guaranteed to maintain a certified lower bound on the feature cosine similarity between clean and adversarial representations under $\ell_2$-bounded attacks. Moreover, we indicate that the value of this Feature Cosine Similarity Bound (FCSB) derived from FS can be improved by enlarging the defined Gaussian robustness score on the vanilla encoder. Building upon this, we introduce the Purifier and Smoothness Mapper (PSM), a plug-and-play module that improves the Gaussian robustness score of MLLMs and thus enhances their certified robustness under FS, without requiring any retraining on MLLMs. We demonstrate that the FS with PSM not only provides a strong theoretical robustness guarantee but also exhibits superior empirical performance compared to adversarial training. Extensive experiments across diverse MLLMs and downstream tasks indicate the effectiveness of the FS-PSM, reducing the Attack Success Rate (ASR) of various white-box attacks from nearly 90\% to about 1\%.