Dissecting Adversarial Robustness of Multimodal LM Agents

📄 arXiv: 2406.12814v3 📥 PDF

作者: Chen Henry Wu, Rishi Shah, Jing Yu Koh, Ruslan Salakhutdinov, Daniel Fried, Aditi Raghunathan

分类: cs.LG, cs.CL, cs.CR, cs.CV

发布日期: 2024-06-18 (更新: 2025-02-04)

备注: ICLR 2025. Also oral at NeurIPS 2024 Open-World Agents Workshop

🔗 代码/项目: GITHUB


💡 一句话要点

提出Agent Robustness Evaluation框架以解决多模态语言模型的对抗鲁棒性问题

🎯 匹配领域: 支柱九:具身大模型 (Embodied Foundation Models)

关键词: 对抗鲁棒性 多模态语言模型 自主代理 安全评估 图结构分析

📋 核心要点

  1. 现有语言模型在构建自主代理时,未能充分评估其在复杂环境中的对抗鲁棒性,存在安全隐患。
  2. 本文提出Agent Robustness Evaluation(ARE)框架,通过图结构分析代理组件间的输出流动,系统性评估鲁棒性。
  3. 实验结果显示,微小的图像扰动可以显著提高攻击成功率,且新组件的加入可能会引入新的脆弱性。

📝 摘要(中文)

随着语言模型(LM)在真实环境中构建自主代理的应用日益增多,确保其对抗鲁棒性成为一项关键挑战。现有的安全评估方法未能充分考虑代理作为复合系统的复杂性。为此,本文手动创建了200个针对性的对抗任务,并在VisualWebArena这一真实环境中进行评估。我们提出了Agent Robustness Evaluation(ARE)框架,将代理视为一个图,分析中间输出在组件间的流动,从而分解鲁棒性。研究发现,通过对单张图像施加微小扰动,攻击者可以成功劫持最新的代理,执行针对性的对抗目标,成功率高达67%。

🔬 方法详解

问题定义:本文旨在解决多模态语言模型代理在真实环境中的对抗鲁棒性问题。现有方法未能充分考虑代理的复合系统特性,导致安全评估不足。

核心思路:提出Agent Robustness Evaluation(ARE)框架,通过将代理视作图结构,分析中间输出流动,分解鲁棒性,从而系统性地评估代理的对抗能力。

技术框架:ARE框架包括多个模块,首先创建针对性的对抗任务,然后通过图结构分析输出流动,最后评估鲁棒性变化。

关键创新:最重要的创新在于将代理视为图结构,能够更全面地分析对抗信息的流动,区别于传统的单一组件评估方法。

关键设计:在实验中,设置了200个针对性的对抗任务,采用微小扰动(小于5%网页像素)进行攻击,评估了不同组件对鲁棒性的影响。

🖼️ 关键图片

fig_0
fig_1
fig_2

📊 实验亮点

实验结果表明,通过对单张图像施加微小扰动,攻击者能够成功劫持最新的代理,执行针对性目标,成功率高达67%。此外,增加新组件可能导致鲁棒性下降,攻击成功率分别提高了15%和20%。

🎯 应用场景

该研究的潜在应用领域包括自主代理的安全性评估、智能助手的对抗攻击防护以及多模态系统的鲁棒性提升。通过提高对抗鲁棒性,可以增强代理在真实环境中的可靠性和安全性,具有重要的实际价值和未来影响。

📄 摘要(原文)

As language models (LMs) are used to build autonomous agents in real environments, ensuring their adversarial robustness becomes a critical challenge. Unlike chatbots, agents are compound systems with multiple components taking actions, which existing LMs safety evaluations do not adequately address. To bridge this gap, we manually create 200 targeted adversarial tasks and evaluation scripts in a realistic threat model on top of VisualWebArena, a real environment for web agents. To systematically examine the robustness of agents, we propose the Agent Robustness Evaluation (ARE) framework. ARE views the agent as a graph showing the flow of intermediate outputs between components and decomposes robustness as the flow of adversarial information on the graph. We find that we can successfully break latest agents that use black-box frontier LMs, including those that perform reflection and tree search. With imperceptible perturbations to a single image (less than 5% of total web page pixels), an attacker can hijack these agents to execute targeted adversarial goals with success rates up to 67%. We also use ARE to rigorously evaluate how the robustness changes as new components are added. We find that inference-time compute that typically improves benign performance can open up new vulnerabilities and harm robustness. An attacker can compromise the evaluator used by the reflexion agent and the value function of the tree search agent, which increases the attack success relatively by 15% and 20%. Our data and code for attacks, defenses, and evaluation are at https://github.com/ChenWu98/agent-attack