CPFD: Confidence-aware Privileged Feature Distillation for Short Video Classification
作者: Jinghao Shi, Xiang Shen, Kaili Zhao, Xuedong Wang, Vera Wen, Zixuan Wang, Yifan Wu, Zhixin Zhang
分类: cs.LG, cs.CV
发布日期: 2024-10-03 (更新: 2024-10-07)
备注: Camera ready for CIKM 2024
💡 一句话要点
提出置信度感知的特权特征蒸馏方法CPFD,提升短视频分类精度。
🎯 匹配领域: 支柱二:RL算法与架构 (RL & Architecture) 支柱九:具身大模型 (Embodied Foundation Models)
关键词: 短视频分类 特权特征蒸馏 置信度感知 多模态学习 知识迁移
📋 核心要点
- 现有方法难以兼顾端到端多模态模型的效率和历史特权密集特征的信息。
- CPFD利用教师模型的置信度,自适应地蒸馏特权特征,提升学生模型性能。
- 实验表明,CPFD显著提升了短视频分类的F1分数,并成功部署于生产环境。
📝 摘要(中文)
针对短视频分类中密集特征计算成本高昂且难以在线推理的问题,本文提出了一种置信度感知的特权特征蒸馏方法(CPFD)。该方法旨在利用端到端多模态模型的效率,同时保留历史特权密集特征的宝贵信息。CPFD通过自适应地蒸馏特权特征来增强端到端多模态模型的特征,利用教师模型的置信度评分来缓解学生模型在不同业务场景下的性能差异。在五个不同的任务上进行的离线实验表明,CPFD相比于端到端多模态模型(X-VLM)平均提高了6.76%的F1分数,相比于原始PFD平均提高了2.31%。CPFD还减少了84.6%的性能差距,并取得了与教师模型DF-X-VLM相当的结果。在线实验进一步证实了CPFD的有效性,并且该框架已在生产系统中部署于十几个模型。
🔬 方法详解
问题定义:短视频分类任务中,为不同业务场景定制的密集特征至关重要。然而,这些特征的复杂性、特定适应性需求以及高计算成本使得它们在在线推理期间资源消耗巨大且难以访问。现有特权特征蒸馏(PFD)方法在蒸馏过程中对所有实例应用统一权重,导致不同业务场景下的性能不稳定,以及教师模型和学生模型之间存在显著的性能差距。
核心思路:本文的核心思路是利用教师模型的置信度来指导特权特征的蒸馏过程。通过引入置信度感知机制,CPFD能够自适应地调整不同实例的蒸馏权重,从而缓解学生模型在不同场景下的性能差异,并缩小与教师模型的性能差距。这种方法旨在更有效地将特权特征的知识迁移到端到端多模态模型中。
技术框架:CPFD框架包含一个教师模型(Dense Feature enhanced multimodal-model DF-X-VLM)和一个学生模型(multimodal-model only X-VLM)。教师模型利用密集特征进行训练,学生模型仅使用多模态特征。在训练过程中,教师模型生成预测结果和置信度评分,置信度评分被用于调整特权特征的蒸馏权重,从而指导学生模型的学习。最终,学生模型在不依赖密集特征的情况下,也能达到接近教师模型的性能。
关键创新:CPFD的关键创新在于引入了置信度感知机制,用于自适应地调整特权特征的蒸馏权重。与传统的PFD方法不同,CPFD能够根据教师模型的置信度,对不同实例进行差异化处理,从而更有效地缓解学生模型在不同场景下的性能差异。这种自适应的蒸馏策略是CPFD的核心优势。
关键设计:CPFD的关键设计包括:1) 使用教师模型的预测置信度作为蒸馏权重的依据;2) 设计合适的损失函数,以平衡预测精度和知识蒸馏;3) 针对不同的业务场景,可能需要调整置信度评分的计算方式,以获得更好的性能。具体的网络结构细节可能依赖于所使用的基础多模态模型(如X-VLM)。
🖼️ 关键图片
📊 实验亮点
CPFD在五个不同的短视频分类任务上进行了离线实验,结果表明,相比于端到端多模态模型(X-VLM),CPFD平均提高了6.76%的F1分数,相比于原始PFD,平均提高了2.31%。同时,CPFD将性能差距减少了84.6%,并取得了与教师模型DF-X-VLM相当的结果。在线实验也验证了CPFD的有效性,并已成功部署于生产系统。
🎯 应用场景
CPFD可广泛应用于各种需要高效短视频分类的场景,例如视频推荐、内容审核、广告投放等。通过降低在线推理的计算成本,CPFD能够提升系统的响应速度和资源利用率,从而带来更好的用户体验和更高的商业价值。该方法也为其他需要知识迁移和模型压缩的任务提供了借鉴。
📄 摘要(原文)
Dense features, customized for different business scenarios, are essential in short video classification. However, their complexity, specific adaptation requirements, and high computational costs make them resource-intensive and less accessible during online inference. Consequently, these dense features are categorized as `Privileged Dense Features'.Meanwhile, end-to-end multi-modal models have shown promising results in numerous computer vision tasks. In industrial applications, prioritizing end-to-end multi-modal features, can enhance efficiency but often leads to the loss of valuable information from historical privileged dense features. To integrate both features while maintaining efficiency and manageable resource costs, we present Confidence-aware Privileged Feature Distillation (CPFD), which empowers features of an end-to-end multi-modal model by adaptively distilling privileged features during training. Unlike existing privileged feature distillation (PFD) methods, which apply uniform weights to all instances during distillation, potentially causing unstable performance across different business scenarios and a notable performance gap between teacher model (Dense Feature enhanced multimodal-model DF-X-VLM) and student model (multimodal-model only X-VLM), our CPFD leverages confidence scores derived from the teacher model to adaptively mitigate the performance variance with the student model. We conducted extensive offline experiments on five diverse tasks demonstrating that CPFD improves the video classification F1 score by 6.76% compared with end-to-end multimodal-model (X-VLM) and by 2.31% with vanilla PFD on-average. And it reduces the performance gap by 84.6% and achieves results comparable to teacher model DF-X-VLM. The effectiveness of CPFD is further substantiated by online experiments, and our framework has been deployed in production systems for over a dozen models.