LiquidTAD: An Efficient Method for Temporal Action Detection via Liquid Neural Dynamics

作者: Zepeng Sun, Naichuan Zheng, Hailun Xia, Junjie Wu, Liwei Bao, Xiaotai Zhang

分类: cs.CV

发布日期: 2026-04-20

💡 一句话要点

LiquidTAD：利用并行化液态神经动力学高效解决时序动作检测问题

🎯 匹配领域: 支柱六：视频提取与匹配 (Video Extraction)

关键词: 时序动作检测 液态神经网络 并行计算 参数高效 视频理解

📋 核心要点

Transformer在时序动作检测中表现出色，但计算复杂度高和参数冗余限制了其在资源受限环境中的部署。
LiquidTAD利用并行化的ActionLiquid块和闭式连续时间公式，在保证性能的同时显著降低了计算复杂度和参数量。
实验表明，LiquidTAD在THUMOS-14上实现了有竞争力的mAP，参数量相比ActionFormer减少63%，并在其他数据集上表现出最佳的精度-效率权衡。

📝 摘要（中文）

本文提出LiquidTAD，一种参数高效的时序动作检测框架，旨在解决基于Transformer架构计算复杂度高、参数冗余的问题。LiquidTAD用并行化的ActionLiquid块取代了自注意力层。与传统液态神经网络(LNNs)的串行执行瓶颈不同，LiquidTAD利用闭式连续时间(CfC)公式，将模型重构为可并行化的算子，同时保留了连续时间动力学的物理先验。该架构以O(N)的线性复杂度捕获复杂的时序依赖关系，并通过学习到的时间常数(τ)自适应地调节时间敏感度，从而为处理不同的动作持续时间提供了一个鲁棒的机制。据我们所知，这项工作首次将基于并行化LNN的架构引入到TAD领域。在THUMOS-14数据集上的实验结果表明，LiquidTAD仅用10.82M的参数就实现了69.46%的平均mAP，与ActionFormer基线相比减少了63%的参数。在ActivityNet-1.3和Ego4D基准上的进一步评估证实，LiquidTAD实现了最佳的精度-效率权衡，并对时间采样变化表现出优异的鲁棒性，从而提升了现代TAD框架的帕累托前沿。

🔬 方法详解

问题定义：时序动作检测(TAD)旨在识别未分割视频中动作的起止时间。现有基于Transformer的方法虽然性能优异，但其自注意力机制的计算复杂度为二次方级别，且参数冗余，难以在资源受限的环境中部署。

核心思路：LiquidTAD的核心思路是用液态神经网络(LNN)替代Transformer中的自注意力层，并利用闭式连续时间(CfC)公式将LNN并行化。这样既能保留LNN捕捉时序依赖的能力，又能避免传统LNN的串行执行瓶颈，从而降低计算复杂度。

技术框架：LiquidTAD的整体架构包含特征提取模块（通常是预训练的CNN或Transformer），以及由多个并行ActionLiquid块组成的时序建模模块。ActionLiquid块接收特征提取模块的输出，通过并行化的LNN进行时序建模，最后输出动作的起止时间预测。

关键创新：LiquidTAD的关键创新在于将LNN并行化，使其能够以线性复杂度处理时序数据。传统LNN是串行执行的，难以并行化。LiquidTAD通过CfC公式将LNN重构为可并行化的算子，从而克服了这一限制。

关键设计：ActionLiquid块是LiquidTAD的核心组件。每个ActionLiquid块包含多个液态神经元，这些神经元通过学习到的时间常数(τ)自适应地调节时间敏感度。损失函数通常包括分类损失（用于预测动作类别）和回归损失（用于预测动作的起止时间）。

📊 实验亮点

LiquidTAD在THUMOS-14数据集上取得了69.46%的平均mAP，同时参数量仅为10.82M，相比ActionFormer基线减少了63%。在ActivityNet-1.3和Ego4D数据集上的实验也表明，LiquidTAD在精度和效率之间取得了最佳的平衡，并且对时间采样变化具有更强的鲁棒性。

🎯 应用场景

LiquidTAD具有参数效率高、计算复杂度低的优点，适用于资源受限的场景，如移动设备、嵌入式系统和边缘计算平台。它可以应用于视频监控、智能安防、机器人导航、自动驾驶等领域，实现高效准确的动作检测。

📄 摘要（原文）

Temporal Action Detection (TAD) in untrimmed videos is currently dominated by Transformer-based architectures. While high-performing, their quadratic computational complexity and substantial parameter redundancy limit deployment in resource-constrained environments. In this paper, we propose LiquidTAD, a novel parameter-efficient framework that replaces cumbersome self-attention layers with parallelized ActionLiquid blocks. Unlike traditional Liquid Neural Networks (LNNs) that suffer from sequential execution bottlenecks, LiquidTAD leverages a closed-form continuous-time (CfC) formulation, allowing the model to be reformulated as a parallelizable operator while preserving the intrinsic physical prior of continuous-time dynamics. This architecture captures complex temporal dependencies with $O(N)$ linear complexity and adaptively modulates temporal sensitivity through learned time-constants ($τ$), providing a robust mechanism for handling varying action durations. To the best of our knowledge, this work is the first to introduce a parallelized LNN-based architecture to the TAD domain. Experimental results on the THUMOS-14 dataset demonstrate that LiquidTAD achieves a highly competitive Average mAP of 69.46\% with only 10.82M parameters -- a 63\% reduction compared to the ActionFormer baseline. Further evaluations on ActivityNet-1.3 and Ego4D benchmarks confirm that LiquidTAD achieves an optimal accuracy-efficiency trade-off and exhibits superior robustness to temporal sampling variations, advancing the Pareto frontier of modern TAD frameworks.

LiquidTAD: An Efficient Method for Temporal Action Detection via Liquid Neural Dynamics

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理