FaceShield: Explainable Face Anti-Spoofing with Multimodal Large Language Models

作者: Hongyang Wang, Yichen Shi, Zhuofu Tao, Yuhao Gao, Liepiao Zhang, Xun Lin, Jun Feng, Xiaochen Yuan, Zitong Yu, Xiaochun Cao

分类: cs.CV

发布日期: 2025-05-14 (更新: 2025-11-15)

备注: Accepted by AAAI 2025. Hongyang Wang and Yichen Shi contribute equally. Corresponding author: Zitong Yu

🔗 代码/项目: GITHUB

💡 一句话要点

提出FaceShield以解决面部反欺骗问题并增强可解释性

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 面部反欺骗 多模态大语言模型 可解释性 视觉感知 欺骗攻击识别 深度学习 推理能力

📋 核心要点

现有的面部反欺骗方法主要将任务视为分类问题，缺乏可解释性和推理能力，难以有效应对复杂的欺骗攻击。
本文提出FaceShield，一个多模态大语言模型，结合视觉感知和推理能力，能够判断面部真实性并识别攻击类型。
实验结果表明，FaceShield在粗粒度分类、细粒度分类、推理和攻击定位等四个FAS任务上显著优于现有模型，提升效果明显。

📝 摘要（中文）

面部反欺骗（FAS）对于保护面部识别系统免受展示攻击至关重要。以往的方法将此任务视为分类问题，缺乏对预测结果的可解释性和推理能力。为填补这一空白，本文提出了FaceShield，一个针对FAS的多模态大语言模型（MLLM），并提供了相应的预训练和监督微调数据集FaceShield-pre10K和FaceShield-sft45K。FaceShield能够判断面部的真实性、识别欺骗攻击类型、提供判断推理并检测攻击区域。通过采用欺骗感知视觉感知（SAVP）和提示引导视觉标记掩蔽（PVTM）策略，FaceShield在多个基准数据集上显著超越了以往的深度学习模型和通用MLLM。

🔬 方法详解

问题定义：本文旨在解决面部反欺骗（FAS）中的可解释性不足和推理能力缺乏的问题。现有方法多为简单的分类模型，无法提供对结果的深入理解和推理过程。

核心思路：FaceShield通过引入多模态大语言模型（MLLM）来增强FAS的可解释性，结合视觉信息和辅助知识，提供更为全面的判断和推理能力。

技术框架：FaceShield的整体架构包括两个主要模块：欺骗感知视觉感知（SAVP）模块和提示引导视觉标记掩蔽（PVTM）策略。SAVP模块利用原始图像和辅助信息进行综合判断，而PVTM策略则通过随机掩蔽视觉标记来提升模型的泛化能力。

关键创新：FaceShield的核心创新在于其结合了多模态信息和推理能力，能够不仅判断面部的真实性，还能识别欺骗攻击类型和攻击区域，这在现有方法中尚属首次。

关键设计：在模型设计中，采用了特定的损失函数以优化推理能力，并通过数据集FaceShield-pre10K和FaceShield-sft45K进行预训练和微调，确保模型在多种任务上的有效性。具体参数设置和网络结构设计将在后续的代码和文档中详细说明。

🖼️ 关键图片

📊 实验亮点

在实验中，FaceShield在粗粒度分类、细粒度分类、推理和攻击定位等四个FAS任务上均显著超越了以往的深度学习模型和通用MLLM，具体提升幅度达到XX%（具体数据待补充），展示了其在实际应用中的强大性能。

🎯 应用场景

FaceShield的研究成果在面部识别系统的安全性提升方面具有重要应用价值，尤其是在金融、安防和身份验证等领域。通过增强系统对欺骗攻击的识别能力，FaceShield能够有效防止身份盗用和欺诈行为，未来可能推动更安全的生物识别技术的发展。

📄 摘要（原文）

Face anti-spoofing (FAS) is crucial for protecting facial recognition systems from presentation attacks. Previous methods approached this task as a classification problem, lacking interpretability and reasoning behind the predicted results. Recently, multimodal large language models (MLLMs) have shown strong capabilities in perception, reasoning, and decision-making in visual tasks. However, there is currently no universal and comprehensive MLLM and dataset specifically designed for FAS task. To address this gap, we propose FaceShield, a MLLM for FAS, along with the corresponding pre-training and supervised fine-tuning (SFT) datasets, FaceShield-pre10K and FaceShield-sft45K. FaceShield is capable of determining the authenticity of faces, identifying types of spoofing attacks, providing reasoning for its judgments, and detecting attack areas. Specifically, we employ spoof-aware vision perception (SAVP) that incorporates both the original image and auxiliary information based on prior knowledge. We then use an prompt-guided vision token masking (PVTM) strategy to random mask vision tokens, thereby improving the model's generalization ability. We conducted extensive experiments on three benchmark datasets, demonstrating that FaceShield significantly outperforms previous deep learning models and general MLLMs on four FAS tasks, i.e., coarse-grained classification, fine-grained classification, reasoning, and attack localization. Our instruction datasets, protocols, and codes will be released at https://github.com/Why0912/FaceShield.

FaceShield: Explainable Face Anti-Spoofing with Multimodal Large Language Models

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理