Harnessing Large Language Models for Training-free Video Anomaly Detection

作者: Luca Zanella, Willi Menapace, Massimiliano Mancini, Yiming Wang, Elisa Ricci

分类: cs.CV

发布日期: 2024-04-01

备注: CVPR 2024. Project website at https://lucazanella.github.io/lavad/

💡 一句话要点

提出LAVAD以解决视频异常检测的训练依赖问题

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 视频异常检测 训练无关 大型语言模型 视觉-语言模型 跨模态相似性 文本描述生成 异常评分估计

📋 核心要点

现有视频异常检测方法依赖于深度模型训练，面临领域特定性和高成本的问题。
本文提出LAVAD，通过利用预训练的语言模型和视觉-语言模型，采用无训练的方式进行异常检测。
在UCF-Crime和XD-Violence数据集上，LAVAD的表现超越了无监督和单类方法，显示出显著的效果提升。

📝 摘要（中文）

视频异常检测（VAD）旨在时序定位视频中的异常事件。现有方法主要依赖于训练深度模型来学习正常性分布，这些方法通常是领域特定的，导致在实际应用中成本高昂。本文提出了一种全新的训练无关的VAD方法LAVAD，利用预训练的大型语言模型（LLMs）和现有的视觉-语言模型（VLMs）来生成视频帧的文本描述。通过这些描述，设计了提示机制以激活LLMs在时间聚合和异常评分估计方面的能力，进而实现有效的异常检测。实验结果表明，LAVAD在真实监控场景的两个大型数据集上表现优于无监督和单类方法，无需任何训练或数据收集。

🔬 方法详解

问题定义：本文解决视频异常检测中的训练依赖问题，现有方法在领域变化时需要重新训练，导致高成本和低适应性。

核心思路：LAVAD通过生成视频帧的文本描述，利用大型语言模型的能力进行异常检测，避免了传统方法的训练过程。

技术框架：整体架构包括视频帧的文本描述生成、提示机制激活LLMs、异常评分估计和基于跨模态相似性的噪声清理。

关键创新：LAVAD的创新在于将大型语言模型应用于视频异常检测，突破了传统训练方法的限制，实现了训练无关的检测。

关键设计：关键设计包括使用视觉-语言模型生成描述、设计提示机制以激活LLMs的能力，以及基于跨模态相似性的方法来清理噪声和优化异常评分。

📊 实验亮点

在UCF-Crime和XD-Violence数据集上，LAVAD的性能显著优于无监督和单类方法，展示了在无需任何训练或数据收集的情况下，异常检测的有效性和可靠性。

🎯 应用场景

该研究在视频监控、公共安全和智能交通等领域具有广泛的应用潜力。通过减少对训练数据的依赖，LAVAD能够快速适应不同场景，提升异常事件的检测效率和准确性，具有重要的实际价值和未来影响。

📄 摘要（原文）

Video anomaly detection (VAD) aims to temporally locate abnormal events in a video. Existing works mostly rely on training deep models to learn the distribution of normality with either video-level supervision, one-class supervision, or in an unsupervised setting. Training-based methods are prone to be domain-specific, thus being costly for practical deployment as any domain change will involve data collection and model training. In this paper, we radically depart from previous efforts and propose LAnguage-based VAD (LAVAD), a method tackling VAD in a novel, training-free paradigm, exploiting the capabilities of pre-trained large language models (LLMs) and existing vision-language models (VLMs). We leverage VLM-based captioning models to generate textual descriptions for each frame of any test video. With the textual scene description, we then devise a prompting mechanism to unlock the capability of LLMs in terms of temporal aggregation and anomaly score estimation, turning LLMs into an effective video anomaly detector. We further leverage modality-aligned VLMs and propose effective techniques based on cross-modal similarity for cleaning noisy captions and refining the LLM-based anomaly scores. We evaluate LAVAD on two large datasets featuring real-world surveillance scenarios (UCF-Crime and XD-Violence), showing that it outperforms both unsupervised and one-class methods without requiring any training or data collection.

Harnessing Large Language Models for Training-free Video Anomaly Detection

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理