Spatial-Temporal Decoupled Adapter for Micro-gesture Online Recognition

作者: Xucheng Shen, Kun Li, Fei Wang, Wei Qian, Jin Jiang, Dan Guo

分类: cs.CV

发布日期: 2026-06-05

备注: Technical Report. 1st Place in Micro-gesture Online Recognition in 4th MiGA at IJCAI 2026

💡 一句话要点

提出空间-时间解耦适配器以解决微手势在线识别问题

🎯 匹配领域: 支柱八：物理动画 (Physics-based Animation)

关键词: 微手势识别 空间-时间解耦 深度学习 自适应增强 计算机视觉

📋 核心要点

现有方法通常通过单一分支建模空间和时间线索，难以捕捉微手势的细粒度特征。
本文提出空间-时间解耦适配器，通过独立的时间和空间分支来增强微手势的识别能力。
实验结果显示，本文方法在挑战赛中取得了0.43808的F1分数，显著提升了识别性能。

📝 摘要（中文）

微手势在线识别旨在对未剪辑视频中的细微手势进行时间定位和分类。由于微手势的持续时间极短、运动幅度低以及视觉线索模糊，捕捉具有区分性的时空表示非常具有挑战性。现有的参数高效适配器通常采用单一分支来联合建模空间和时间线索，这可能无法捕捉微手势的细粒度模式。为了解决这一局限性，本文提出了一种空间-时间解耦适配器，通过轻量级深度卷积将视频适配分解为独立的时间和空间分支。此外，为了解决基准数据集中长尾分布问题，我们引入了自适应软平衡增强，根据类别稀缺性和学习难度动态分配增强强度，无需手动阈值。我们的方法在第四届EI-MiGA-IJCAI挑战赛的第二轨道中取得了0.43808的F1分数，排名第一。

🔬 方法详解

问题定义：本文旨在解决微手势在线识别中的时空表示捕捉困难，现有方法无法有效区分细微手势特征，导致识别精度不足。

核心思路：提出空间-时间解耦适配器，通过将视频适配分解为独立的时间和空间分支，利用轻量级深度卷积来增强对微手势的识别能力。

技术框架：整体架构包括两个主要分支：时间分支和空间分支，分别处理视频中的时间和空间信息。通过深度卷积网络实现特征提取，并结合自适应软平衡增强技术来优化训练过程。

关键创新：最重要的创新在于空间-时间解耦的设计，使得模型能够独立捕捉时间和空间特征，从而提高微手势的识别精度。这一方法与现有的单一分支模型本质上不同，能够更好地处理细粒度的手势信息。

关键设计：在网络结构上，采用轻量级的深度卷积以减少计算复杂度；损失函数设计上引入自适应软平衡增强，根据类别的稀缺性和学习难度动态调整增强强度，避免了手动设置阈值的繁琐。

🖼️ 关键图片

📊 实验亮点

实验结果表明，本文方法在第四届EI-MiGA-IJCAI挑战赛中取得了0.43808的F1分数，排名第一，相较于基线方法有显著提升，展示了空间-时间解耦适配器在微手势识别中的有效性。

🎯 应用场景

该研究的潜在应用领域包括人机交互、虚拟现实、增强现实等场景，能够提升设备对用户微手势的识别能力，进而改善用户体验。未来，该技术还可能扩展到智能家居、医疗监测等多个领域，具有广泛的实际价值和影响力。

📄 摘要（原文）

Micro-gesture online recognition aims to temporally localize and classify subtle gestures in untrimmed videos. Owing to their extremely short duration, low motion amplitude, and ambiguous visual cues, capturing discriminative spatiotemporal representations remains highly challenging. Existing parameter-efficient adapters typically employ a single branch to model spatial and temporal cues jointly, which may fail to capture the fine-grained patterns of micro-gestures. To address this limitation, we propose a Spatial-Temporal Decoupled Adapter that decomposes video adaptation into independent temporal and spatial branches via lightweight depthwise convolutions. In addition, to address the long-tail distribution problem in the benchmark dataset, we introduce Adaptive Soft Balanced Augmentation, which dynamically allocates augmentation intensity based on class rarity and learning difficulty, without manual thresholds. Our method achieves an F1 score of 0.43808, ranking 1st in Track 2 of the 4th EI-MiGA-IJCAI Challenge.

Spatial-Temporal Decoupled Adapter for Micro-gesture Online Recognition

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理