From CNNs to Transformers in Multimodal Human Action Recognition: A Survey

作者: Muhammad Bilal Shaikh, Syed Mohammed Shamsul Islam, Douglas Chai, Naveed Akhtar

分类: cs.CV

发布日期: 2024-05-22

备注: 23 pages, 5 figures and 3 Tables. To appear in ACM Trans. Multimedia Comput. Commun. Appl.(TOMM) 2024

DOI: 10.1145/3664815

💡 一句话要点

综述多模态人体行为识别中CNN到Transformer的演变与融合策略

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 人体行为识别 多模态学习 卷积神经网络 Transformer 特征融合 深度学习 计算机视觉 行为分析

📋 核心要点

现有的人体行为识别方法在处理多模态数据时，特征融合策略仍有提升空间，难以充分利用不同模态的信息。
本文聚焦多模态人体行为识别，深入分析了基于CNN和Transformer的架构设计，并着重探讨了特征融合策略。
通过对现有方法的分析和总结，本文旨在为多模态人体行为识别领域的研究者提供有价值的参考，并推动该领域的发展。

📝 摘要（中文）

人体行为识别是计算机视觉中一个被广泛研究的问题，其应用十分广泛。最近的研究表明，与依赖单一数据模态相比，使用多模态数据可以获得更好的性能。在过去十年深度学习应用于视觉建模的过程中，行为识别方法主要依赖于卷积神经网络（CNN）。然而，Transformer在视觉建模领域的兴起也正在引起行为识别任务的范式转变。本综述捕捉了这一转变，重点关注多模态人体行为识别（MHAR）。多模态计算模型的一个独特之处在于融合各个数据模态特征的过程。因此，我们特别关注MHAR方法中的融合设计。我们分析了这方面的经典和新兴技术，同时强调了CNN和Transformer构建块在解决整体问题中的流行趋势。特别地，我们强调了最近的设计选择，这些选择促成了更高效的MHAR模型。与现有从广阔视角讨论人体行为识别的综述不同，本综述旨在通过识别有希望的架构和融合设计选择来训练可实践的模型，从而推动MHAR研究的边界。我们还从规模和评估的角度提供了多模态数据集的展望。最后，在回顾文献的基础上，我们讨论了MHAR的挑战和未来方向。

🔬 方法详解

问题定义：人体行为识别旨在理解视频或传感器数据中人类的行为活动。现有方法在处理多模态数据时，面临着如何有效融合不同模态特征的挑战。传统方法可能简单地将特征拼接或求和，无法充分捕捉模态间的复杂关系，导致识别精度受限。此外，如何设计高效的MHAR模型也是一个重要问题。

核心思路：本文的核心思路是分析和总结现有的多模态人体行为识别方法，特别是基于CNN和Transformer的方法，并重点关注特征融合策略。通过对比不同方法的优缺点，为研究者提供设计更有效的MHAR模型的指导。本文强调了最近的设计选择，这些选择促成了更高效的MHAR模型。

技术框架：本文主要以综述的形式呈现，没有提出新的技术框架。文章首先介绍了人体行为识别的基本概念和挑战，然后分别讨论了基于CNN和Transformer的MHAR方法。重点分析了不同模态的特征提取方法和融合策略，例如早期融合、晚期融合和注意力机制等。最后，对现有的多模态数据集进行了总结，并展望了未来的研究方向。

关键创新：本文的创新之处在于其对多模态人体行为识别领域的系统性总结和分析。与现有的综述不同，本文更加关注特征融合策略，并深入探讨了CNN和Transformer在MHAR中的应用。通过对现有方法的对比和分析，本文为研究者提供了设计更有效的MHAR模型的指导。

关键设计：本文没有提出新的算法或模型，因此没有具体的参数设置、损失函数或网络结构等技术细节。但是，本文对现有方法的关键设计进行了总结，例如：不同模态的特征提取方法（如使用3D CNN提取视频特征，使用LSTM提取时间序列特征），不同的融合策略（如早期融合、晚期融合、注意力机制），以及不同的损失函数（如交叉熵损失、对比损失）。

🖼️ 关键图片

📊 实验亮点

本文是一篇综述性文章，没有具体的实验结果。但是，文章对现有的MHAR方法进行了全面的总结和分析，并指出了未来研究的方向。通过阅读本文，研究者可以快速了解MHAR领域的研究进展，并找到自己感兴趣的研究方向。

🎯 应用场景

多模态人体行为识别在视频监控、人机交互、智能家居、医疗健康等领域具有广泛的应用前景。例如，在视频监控中，可以用于识别异常行为；在人机交互中，可以用于理解用户的意图；在智能家居中，可以用于提供个性化的服务；在医疗健康中，可以用于监测患者的健康状况。未来，随着技术的不断发展，MHAR将在更多领域发挥重要作用。

📄 摘要（原文）

Due to its widespread applications, human action recognition is one of the most widely studied research problems in Computer Vision. Recent studies have shown that addressing it using multimodal data leads to superior performance as compared to relying on a single data modality. During the adoption of deep learning for visual modelling in the last decade, action recognition approaches have mainly relied on Convolutional Neural Networks (CNNs). However, the recent rise of Transformers in visual modelling is now also causing a paradigm shift for the action recognition task. This survey captures this transition while focusing on Multimodal Human Action Recognition (MHAR). Unique to the induction of multimodal computational models is the process of "fusing" the features of the individual data modalities. Hence, we specifically focus on the fusion design aspects of the MHAR approaches. We analyze the classic and emerging techniques in this regard, while also highlighting the popular trends in the adaption of CNN and Transformer building blocks for the overall problem. In particular, we emphasize on recent design choices that have led to more efficient MHAR models. Unlike existing reviews, which discuss Human Action Recognition from a broad perspective, this survey is specifically aimed at pushing the boundaries of MHAR research by identifying promising architectural and fusion design choices to train practicable models. We also provide an outlook of the multimodal datasets from their scale and evaluation viewpoint. Finally, building on the reviewed literature, we discuss the challenges and future avenues for MHAR.

From CNNs to Transformers in Multimodal Human Action Recognition: A Survey

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理