AffectAgent: Collaborative Multi-Agent Reasoning for Retrieval-Augmented Multimodal Emotion Recognition
作者: Zeheng Wang, Zitong Yu, Yijie Zhu, Bo Zhao, Haochen Liang, Taorui Wang, Wei Xia, Jiayu Zhang, Zhishu Liu, Hui Ma, Fei Ma, Qi Tian
分类: cs.CV
发布日期: 2026-04-14
🔗 代码/项目: GITHUB
💡 一句话要点
提出AffectAgent以解决多模态情感识别中的模态歧义问题
🎯 匹配领域: 支柱九:具身大模型 (Embodied Foundation Models)
关键词: 多模态情感识别 智能体协作 检索增强生成 情感理解 跨模态融合
📋 核心要点
- 现有的多模态情感识别方法在处理模态歧义时表现不佳,难以捕捉复杂的情感依赖关系。
- AffectAgent通过引入三个协作的智能体,优化情感理解过程,解决了单轮检索增强生成的局限性。
- 在MER-UniBench上的实验结果显示,AffectAgent在多种复杂场景下的性能显著优于现有基线方法。
📝 摘要(中文)
基于大型语言模型的多模态情感识别依赖于静态参数记忆,常常在解读细微情感状态时产生幻觉。本文提出AffectAgent,一个面向情感的多智能体检索增强生成框架,通过智能体之间的协作决策实现细粒度的情感理解。AffectAgent由查询规划器、证据过滤器和情感生成器三个专门优化的智能体组成,协同进行分析推理,检索跨模态样本、评估证据并生成预测。通过多智能体近端策略优化(MAPPO)进行端到端优化,确保一致的情感理解。此外,提出了动态调节不同模态贡献的模态平衡专家混合(MB-MoE)和在缺失模态条件下增强语义补全的检索增强自适应融合(RAAF)。在MER-UniBench上的广泛实验表明,AffectAgent在复杂场景中表现优越。
🔬 方法详解
问题定义:本文旨在解决现有多模态情感识别方法在处理模态歧义和复杂情感依赖时的不足,尤其是单轮检索增强生成的局限性。
核心思路:AffectAgent通过引入多个智能体进行协作决策,优化情感理解过程,确保在多模态环境中能够更准确地捕捉情感信息。
技术框架:AffectAgent框架包括三个主要模块:查询规划器负责生成检索查询,证据过滤器用于评估和筛选跨模态证据,情感生成器则基于检索到的证据生成情感预测。这些模块通过多智能体近端策略优化(MAPPO)进行端到端优化。
关键创新:最重要的创新在于引入了协作智能体的机制,使得情感理解更加细致和准确,同时通过共享情感奖励来确保一致性,这与传统方法的单一模型处理方式有本质区别。
关键设计:在设计中,采用了模态平衡专家混合(MB-MoE)来动态调节不同模态的贡献,解决跨模态异质性带来的表示不匹配问题,同时引入检索增强自适应融合(RAAF)以增强在缺失模态条件下的语义补全能力。
🖼️ 关键图片
📊 实验亮点
在MER-UniBench的实验中,AffectAgent在复杂场景下的表现显著优于现有基线方法,具体性能提升幅度达到XX%,展示了其在多模态情感识别中的有效性和潜力。
🎯 应用场景
AffectAgent的研究成果可广泛应用于情感计算、社交机器人、智能客服等领域,能够提升人机交互的情感理解能力,进而改善用户体验。未来,该框架还可扩展至其他多模态任务,如视频理解和情感分析等。
📄 摘要(原文)
LLM-based multimodal emotion recognition relies on static parametric memory and often hallucinates when interpreting nuanced affective states. In this paper, given that single-round retrieval-augmented generation is highly susceptible to modal ambiguity and therefore struggles to capture complex affective dependencies across modalities, we introduce AffectAgent, an affect-oriented multi-agent retrieval-augmented generation framework that leverages collaborative decision-making among agents for fine-grained affective understanding. Specifically, AffectAgent comprises three jointly optimized specialized agents, namely a query planner, an evidence filter, and an emotion generator, which collaboratively perform analytical reasoning to retrieve cross-modal samples, assess evidence, and generate predictions. These agents are optimized end-to-end using Multi-Agent Proximal Policy Optimization (MAPPO) with a shared affective reward to ensure consistent emotion understanding. Furthermore, we introduce Modality-Balancing Mixture of Experts (MB-MoE) and Retrieval-Augmented Adaptive Fusion (RAAF), where MB-MoE dynamically regulates the contributions of different modalities to mitigate representation mismatch caused by cross-modal heterogeneity, while RAAF enhances semantic completion under missing-modality conditions by incorporating retrieved audiovisual embeddings. Extensive experiments on MER-UniBench demonstrate that AffectAgent achieves superior performance across complex scenarios. Our code will be released at: https://github.com/Wz1h1NG/AffectAgent.