ManifoldGD: Training-Free Hierarchical Manifold Guidance for Diffusion-Based Dataset Distillation

作者: Ayush Roy, Wei-Yang Alex Lee, Rudrasis Chakraborty, Vishnu Suresh Lokhande

分类: cs.CV, cs.LG

发布日期: 2026-02-26

备注: CVPE 2026

💡 一句话要点

提出ManifoldGD，一种基于流形引导的无训练扩散数据集蒸馏方法。

🎯 匹配领域: 支柱二：RL算法与架构 (RL & Architecture)

关键词: 数据集蒸馏 扩散模型 流形学习 无训练学习 生成模型

📋 核心要点

现有数据集蒸馏方法依赖简单的模式引导，忽略了数据流形的几何结构，导致合成数据集的质量受限。
ManifoldGD通过在扩散模型的去噪过程中引入流形引导，约束生成过程在数据流形上进行，提升数据集的代表性和多样性。
实验表明，ManifoldGD在FID、l2距离和分类精度上优于现有方法，证明了其在无训练数据集蒸馏方面的有效性。

📝 摘要（中文）

本文提出ManifoldGD，一种无训练的扩散数据集蒸馏框架，它在每个去噪时间步集成流形一致性引导。该方法通过VAE潜在特征的分层分裂聚类计算实例原型中心(IPC)，产生多尺度IPC核心集，捕获粗粒度语义模式和细粒度类内变异。利用提取的IPC中心点的局部邻域，为每个扩散去噪时间步创建潜在流形。在每个去噪步骤中，将模式对齐向量投影到估计的潜在流形的局部切空间上，从而约束生成轨迹保持流形忠实性，同时保持语义一致性。这种公式提高了代表性、多样性和图像保真度，而无需任何模型重新训练。实验结果表明，在FID、真实和合成数据集嵌入之间的l2距离以及分类精度方面，相对于现有的无训练和基于训练的基线，ManifoldGD 始终获得提升，从而确立了 ManifoldGD 作为第一个几何感知无训练数据蒸馏框架。

🔬 方法详解

问题定义：数据集蒸馏旨在合成紧凑的数据集，以保留大规模训练集的知识，同时大幅减少存储和计算。现有的基于扩散模型的无训练数据集蒸馏方法，通常采用简单的模式引导，例如直接向实例原型中心（IPC）靠拢，忽略了数据本身潜在的流形结构，导致生成的合成数据集缺乏代表性和多样性。

核心思路：ManifoldGD的核心思想是在扩散模型的去噪过程中，引入流形一致性引导。具体来说，通过估计数据在潜在空间的流形结构，并在每一步去噪过程中，将生成过程约束在该流形上，从而保证生成的数据点既具有语义一致性，又能够更好地代表原始数据集的分布。

技术框架：ManifoldGD的整体框架包括以下几个主要阶段：1) 使用预训练的VAE将原始数据集编码到潜在空间；2) 在潜在空间中，通过分层分裂聚类算法计算实例原型中心（IPC），构建多尺度的IPC核心集；3) 利用IPC的局部邻域估计潜在流形；4) 在扩散模型的去噪过程中，将模式对齐向量投影到估计的潜在流形的局部切空间上，实现流形引导；5) 将生成的潜在空间表示解码回图像空间，得到合成的数据集。

关键创新：ManifoldGD最关键的创新在于引入了流形引导的概念，将数据流形的几何结构融入到数据集蒸馏的过程中。与现有方法相比，ManifoldGD不再简单地将生成过程引导到IPC，而是约束生成过程在数据流形上进行，从而更好地保留了原始数据集的结构信息。这是第一个几何感知的无训练数据蒸馏框架。

关键设计：ManifoldGD的关键设计包括：1) 使用分层分裂聚类算法构建多尺度的IPC核心集，以捕获不同尺度的语义信息；2) 利用IPC的局部邻域估计潜在流形，采用局部切空间投影的方式实现流形引导；3) 在扩散模型的每一步去噪过程中都进行流形引导，保证生成过程的流形一致性。具体参数设置方面，需要根据数据集的特点选择合适的VAE模型和聚类算法参数。

🖼️ 关键图片

📊 实验亮点

实验结果表明，ManifoldGD在CIFAR-10和ImageNet数据集上均取得了显著的性能提升。例如，在CIFAR-10上，ManifoldGD在FID指标上优于现有无训练方法，并且在分类精度上与一些基于训练的方法相当。在ImageNet数据集上，ManifoldGD也展现了良好的性能，证明了其在大规模数据集上的有效性。

🎯 应用场景

ManifoldGD可应用于各种需要数据集压缩的场景，例如移动设备上的模型部署、资源受限环境下的机器学习、以及大规模数据集的快速原型设计。通过生成高质量的合成数据集，ManifoldGD可以降低存储和计算成本，加速模型训练和部署，并促进机器学习技术在更广泛领域的应用。

📄 摘要（原文）

In recent times, large datasets hinder efficient model training while also containing redundant concepts. Dataset distillation aims to synthesize compact datasets that preserve the knowledge of large-scale training sets while drastically reducing storage and computation. Recent advances in diffusion models have enabled training-free distillation by leveraging pre-trained generative priors; however, existing guidance strategies remain limited. Current score-based methods either perform unguided denoising or rely on simple mode-based guidance toward instance prototype centroids (IPC centroids), which often are rudimentary and suboptimal. We propose Manifold-Guided Distillation (ManifoldGD), a training-free diffusion-based framework that integrates manifold consistent guidance at every denoising timestep. Our method employs IPCs computed via a hierarchical, divisive clustering of VAE latent features, yielding a multi-scale coreset of IPCs that captures both coarse semantic modes and fine intra-class variability. Using a local neighborhood of the extracted IPC centroids, we create the latent manifold for each diffusion denoising timestep. At each denoising step, we project the mode-alignment vector onto the local tangent space of the estimated latent manifold, thus constraining the generation trajectory to remain manifold-faithful while preserving semantic consistency. This formulation improves representativeness, diversity, and image fidelity without requiring any model retraining. Empirical results demonstrate consistent gains over existing training-free and training-based baselines in terms of FID, l2 distance among real and synthetic dataset embeddings, and classification accuracy, establishing ManifoldGD as the first geometry-aware training-free data distillation framework.

ManifoldGD: Training-Free Hierarchical Manifold Guidance for Diffusion-Based Dataset Distillation

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理