Dataset Distillation with Neural Characteristic Function: A Minmax Perspective
作者: Shaobo Wang, Yicun Yang, Zhiyuan Liu, Chenghao Sun, Xuming Hu, Conghui He, Linfeng Zhang
分类: cs.CV, cs.AI, cs.LG
发布日期: 2025-02-28
备注: Accepted by CVPR 2025, 11 pages, 7 figures
期刊: Conference on Computer Vision and Pattern Recognition, 2025
💡 一句话要点
提出神经特征函数以解决数据集蒸馏中的分布匹配问题
🎯 匹配领域: 支柱二:RL算法与架构 (RL & Architecture)
关键词: 数据集蒸馏 分布匹配 神经网络 特征函数 最小最大优化 深度学习 性能提升
📋 核心要点
- 现有的分布匹配方法在度量分布差异时常常无法准确捕捉,导致不可靠的差异度量。
- 本文提出了一种新的数据集蒸馏方法,将其视为最小最大优化问题,并引入神经特征函数差异(NCFD)作为度量标准。
- 实验结果显示,该方法在ImageSquawk上实现了20.5%的准确率提升,并显著降低了GPU内存使用和处理速度。
📝 摘要(中文)
数据集蒸馏作为减少深度学习数据需求的有效方法,面临现有分布匹配方法在度量分布差异时的不足。本文将数据集蒸馏重新表述为一个最小最大优化问题,并引入神经特征函数差异(NCFD),作为一种全面且理论基础扎实的度量分布差异的指标。NCFD利用特征函数(CF)来封装完整的分布信息,通过神经网络优化CF频率参数的采样策略,从而最大化差异以增强距离估计。我们的神经特征函数匹配方法在真实和合成数据的特征相位和幅度上实现了内在对齐,实验结果表明该方法在低分辨率和高分辨率数据集上均显著提升了性能。
🔬 方法详解
问题定义:本文旨在解决现有数据集蒸馏方法在分布匹配中度量差异不准确的问题,导致合成数据的质量不高。
核心思路:通过将数据集蒸馏重新表述为最小最大优化问题,引入神经特征函数差异(NCFD),以全面捕捉分布信息并优化合成数据的生成。
技术框架:整体方法包括两个主要阶段:首先,利用神经网络优化特征函数的频率参数以最大化分布差异;其次,在此优化的NCFD度量下,最小化真实数据与合成数据之间的差异。
关键创新:NCFD作为一种新的度量标准,能够全面反映分布差异,与传统方法相比,提供了更为可靠的度量基础。
关键设计:在网络结构上,采用了特定的损失函数以平衡真实与合成数据的特征相位和幅度,同时优化了采样策略以提高效率。具体参数设置和网络架构细节在论文中有详细描述。
🖼️ 关键图片
📊 实验亮点
实验结果表明,本文方法在ImageSquawk数据集上实现了20.5%的准确率提升,同时在GPU内存使用上减少了300倍,处理速度提升了20倍,展示了显著的性能优势。
🎯 应用场景
该研究的潜在应用领域包括计算机视觉、自然语言处理等深度学习相关领域,尤其是在数据稀缺的情况下,能够有效提升模型的训练效率和性能。未来,该方法可能推动更高效的数据处理和模型训练技术的发展。
📄 摘要(原文)
Dataset distillation has emerged as a powerful approach for reducing data requirements in deep learning. Among various methods, distribution matching-based approaches stand out for their balance of computational efficiency and strong performance. However, existing distance metrics used in distribution matching often fail to accurately capture distributional differences, leading to unreliable measures of discrepancy. In this paper, we reformulate dataset distillation as a minmax optimization problem and introduce Neural Characteristic Function Discrepancy (NCFD), a comprehensive and theoretically grounded metric for measuring distributional differences. NCFD leverages the Characteristic Function (CF) to encapsulate full distributional information, employing a neural network to optimize the sampling strategy for the CF's frequency arguments, thereby maximizing the discrepancy to enhance distance estimation. Simultaneously, we minimize the difference between real and synthetic data under this optimized NCFD measure. Our approach, termed Neural Characteristic Function Matching (\mymethod{}), inherently aligns the phase and amplitude of neural features in the complex plane for both real and synthetic data, achieving a balance between realism and diversity in synthetic samples. Experiments demonstrate that our method achieves significant performance gains over state-of-the-art methods on both low- and high-resolution datasets. Notably, we achieve a 20.5\% accuracy boost on ImageSquawk. Our method also reduces GPU memory usage by over 300$\times$ and achieves 20$\times$ faster processing speeds compared to state-of-the-art methods. To the best of our knowledge, this is the first work to achieve lossless compression of CIFAR-100 on a single NVIDIA 2080 Ti GPU using only 2.3 GB of memory.