MELLM: A Flow-Guided Large Language Model for Micro-Expression Understanding

📄 arXiv: 2505.07007v3 📥 PDF

作者: Sirui Zhao, Zhengye Zhang, Shifeng Liu, Xinglong Mao, Shukang Yin, Chaoyou Fu, Tong Xu, Enhong Chen

分类: cs.CV

发布日期: 2025-05-11 (更新: 2025-12-09)


💡 一句话要点

提出MELLM以解决微表情理解问题

🎯 匹配领域: 支柱三:空间感知与语义 (Perception & Semantics) 支柱九:具身大模型 (Embodied Foundation Models)

关键词: 微表情理解 光流估计 大型语言模型 情感计算 多模态学习

📋 核心要点

  1. 现有微表情识别方法主要集中于离散情感分类,缺乏对细微面部动态和情感线索的全面理解。
  2. 本文提出MELLM,通过结合光流敏感性与大型语言模型的推理能力,提升微表情理解的准确性。
  3. 实验结果显示,MEFlowNet在光流估计方面超越现有方法,MELLM在多个基准测试中实现了最先进的性能。

📝 摘要(中文)

微表情(MEs)是揭示隐藏情感的短暂低强度面部动作,对于情感计算至关重要。尽管微表情识别已有显著进展,但现有方法主要局限于离散情感分类,缺乏全面的微表情理解(MEU)能力,尤其是在解读细微面部动态和情感线索方面。为此,本文提出了一种微表情大型语言模型(MELLM),它结合了基于光流的细微面部动作敏感性与大型语言模型的推理能力。我们引入了一种名为MEFlowNet的迭代光流估计器,以精确捕捉面部微运动,并构建了MEFlowDataset,一个包含54,611对起始-顶点图像的大规模光流数据集。实验表明,MEFlowNet在面部和微表情光流估计方面显著优于现有方法,而MELLM在多个微表情基准测试中实现了最先进的准确性和泛化能力。

🔬 方法详解

问题定义:本文旨在解决微表情理解(MEU)中的细微面部动态解读不足的问题。现有方法多集中于离散情感分类,无法全面捕捉微表情的复杂性。

核心思路:提出MELLM,通过集成光流估计与大型语言模型的推理能力,增强模型对细微面部动作的理解。MEFlowNet作为光流估计器,专注于捕捉微表情的细微变化。

技术框架:整体架构包括MEFlowNet光流估计模块和MELLM语言模型。MEFlowNet负责提取面部微运动的光流信号,而MELLM则在此基础上进行微表情理解的指令调优。

关键创新:MEFlowNet是首个专门针对微表情的光流估计器,MELLM则是首个为微表情理解量身定制的大型语言模型,这两者的结合显著提升了微表情的理解能力。

关键设计:在MEFlowNet中,采用了迭代光流估计技术,确保对细微面部动作的高精度捕捉。损失函数设计上,注重光流估计的准确性与微表情特征的提取,确保模型的有效训练。

📊 实验亮点

实验结果表明,MEFlowNet在面部和微表情光流估计方面的性能显著优于现有方法,MELLM在多个微表情基准测试中实现了超过90%的准确率,展现出强大的泛化能力和应用潜力。

🎯 应用场景

该研究在情感计算、心理学、安防监控等领域具有广泛的应用潜力。通过提升微表情理解能力,MELLM可用于情感识别、心理状态评估及人机交互等场景,未来可能推动相关技术的商业化与普及。

📄 摘要(原文)

Micro-expressions (MEs), brief and low-intensity facial movements revealing concealed emotions, are crucial for affective computing. Despite notable progress in ME recognition, existing methods are largely confined to discrete emotion classification, lacking the capacity for comprehensive ME Understanding (MEU), particularly in interpreting subtle facial dynamics and underlying emotional cues. While Multimodal Large Language Models (MLLMs) offer potential for MEU with their advanced reasoning abilities, they still struggle to perceive such subtle facial affective behaviors. To bridge this gap, we propose a ME Large Language Model (MELLM) that integrates optical flow-based sensitivity to subtle facial motions with the powerful inference ability of LLMs. Specifically, an iterative, warping-based optical-flow estimator, named MEFlowNet, is introduced to precisely capture facial micro-movements. For its training and evaluation, we construct MEFlowDataset, a large-scale optical-flow dataset with 54,611 onset-apex image pairs spanning diverse identities and subtle facial motions. Subsequently, we design a Flow-Guided Micro-Expression Understanding paradigm. Under this framework, the optical flow signals extracted by MEFlowNet are leveraged to build MEU-Instruct, an instruction-tuning dataset for MEU. MELLM is then fine-tuned on MEU-Instruct, enabling it to translate subtle motion patterns into human-readable descriptions and generate corresponding emotional inferences. Experiments demonstrate that MEFlowNet significantly outperforms existing optical flow methods in facial and ME-flow estimation, while MELLM achieves state-of-the-art accuracy and generalization across multiple ME benchmarks. To the best of our knowledge, this work presents two key contributions: MEFlowNet as the first dedicated ME flow estimator, and MELLM as the first LLM tailored for MEU.