LSHBloom: Memory-efficient, Extreme-scale Document Deduplication

作者: Arham Khan, Robert Underwood, Carlo Siebenschuh, Yadu Babuji, Aswathy Ajith, Kyle Hippe, Ozan Gokdemir, Alexander Brace, Kyle Chard, Ian Foster

分类: cs.LG

发布日期: 2024-11-06 (更新: 2025-12-02)

💡 一句话要点

LSHBloom：一种内存高效的、可扩展的文档去重方法

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 文档去重 MinhashLSH Bloom过滤器 大规模数据集 语言模型训练

📋 核心要点

大型语言模型训练依赖海量文本数据，但数据集中存在大量重复文档，影响训练效率和模型质量。
LSHBloom通过用Bloom过滤器替换MinhashLSH中的LSHIndex，降低了内存占用，提升了运行速度。
实验表明，LSHBloom在保持去重精度的前提下，显著减少了磁盘空间占用和运行时间，适用于大规模数据集。

📝 摘要（中文）

现代大型语言模型（LLM）训练流程需要从各种来源（例如，网络、学术界和出版商）组装互联网规模的文本数据数据库。通过去重预处理这些数据集——检测并消除相同内容的额外实例——是为LLM组装和管理训练数据集的一个主要关注点。不受限制的训练数据集中的重复数据会增加训练成本，并导致不良属性，例如训练模型中的记忆或评估中的作弊。不幸的是，目前文档级去重的方法要么在准确识别重复文档方面不可靠，要么在运行时间和内存方面极其昂贵。我们提出了LSHBloom，它是MinhashLSH的扩展，用轻量级的Bloom过滤器取代了昂贵的LSHIndex。LSHBloom展示了与MinhashLSH相同的最先进的去重性能，只有边际的误报增加（在我们的实验中接近于零），同时拥有有竞争力的运行时间（在peS2o上比MinhashLSH快12倍），并且至关重要的是，使用的磁盘空间比MinhashLSH少18倍（在peS2o上测量）。基于外推，我们表明，即使在数十亿文档的极端规模下，这种在空间和运行时间上的优势仍然存在。LSHBloom允许从业者以通常只有不太复杂的启发式解决方案才能处理的规模访问MinHashLSH的去重质量。因此，LSHBloom有望将高质量的文档去重扩展到互联网规模的文本数据集。

🔬 方法详解

问题定义：论文旨在解决互联网规模文本数据集的文档去重问题。现有方法，如MinhashLSH，虽然去重效果好，但内存占用过高，难以扩展到数十亿文档级别。其他启发式方法虽然效率高，但去重精度不足。

核心思路：论文的核心思路是用Bloom过滤器代替MinhashLSH中耗费内存的LSHIndex。Bloom过滤器是一种空间效率极高的数据结构，用于判断元素是否存在于集合中，允许一定的误判率。通过牺牲极小的精度，大幅降低内存占用。

技术框架：LSHBloom的整体流程如下：1. 对文档进行Minhash签名，生成文档的指纹向量。2. 将指纹向量的每个哈希值插入到Bloom过滤器中。3. 对于新的文档，同样生成指纹向量，并检查每个哈希值是否存在于Bloom过滤器中。如果所有哈希值都存在，则认为该文档是重复的。

关键创新：关键创新在于使用Bloom过滤器替代LSHIndex。LSHIndex需要存储大量的哈希桶，而Bloom过滤器只需要存储一个位数组，因此大大降低了内存占用。此外，论文还对Bloom过滤器的参数进行了优化，以在精度和内存占用之间取得平衡。

关键设计：LSHBloom的关键设计包括：1. Minhash的哈希函数数量：需要根据数据集的大小和相似度阈值进行调整。2. Bloom过滤器的位数组大小：决定了误判率，需要根据数据集的大小和可接受的误判率进行调整。3. 哈希函数的选择：需要选择具有良好分布性的哈希函数，以减少冲突。

🖼️ 关键图片

📊 实验亮点

实验结果表明，LSHBloom在peS2o数据集上比MinhashLSH快12倍，磁盘空间占用减少18倍，且误报率接近于零。外推结果显示，即使在数十亿文档的规模下，LSHBloom在空间和运行时间上的优势仍然存在。这表明LSHBloom能够在保持去重精度的前提下，显著提高大规模文档去重的效率。

🎯 应用场景

LSHBloom可应用于大规模语言模型训练数据的清洗，去除重复文档，提高训练效率和模型质量。此外，该方法还可用于网页去重、版权保护、信息检索等领域，具有广泛的应用前景。通过降低内存占用，LSHBloom使得大规模文档去重成为可能，推动了相关领域的发展。

📄 摘要（原文）

Contemporary large language model (LLM) training pipelines require the assembly of internet-scale databases full of text data from a variety of sources (e.g., web, academic, and publishers). Preprocessing these datasets via deduplication -- detecting and eliminating additional instances of the same content -- is a major focus for assembling and curating training datasets for LLMs. Unrestrained, duplicates in the training dataset increase training costs and lead to undesirable properties such as memorization in trained models or cheating on evaluation. Unfortunately, contemporary approaches to document-level deduplication are either unreliable at accurately identifying duplicate documents or extremely expensive in terms of both runtime and memory. We propose LSHBloom, an extension to MinhashLSH, which replaces the expensive LSHIndex with lightweight Bloom filters. LSHBloom demonstrates the same state-of-the-art deduplication performance as MinhashLSH, with only a marginal increase in false positives (near zero in our experiments), while boasting competitive runtime (12$\times$ faster than MinhashLSH on peS2o) and, crucially, using 18$\times$ less disk space than MinhashLSH (as measured on peS2o). Based on extrapolation, we show that this advantage in space and runtime remains even at the extreme scale of several billion documents. LSHBloom allows practitioners to access the deduplication quality of MinHashLSH at scales that are normally only tractable for less sophisticated, heuristic solutions. As a result, LSHBloom promises to enable scaling high-quality document deduplication to internet-scale text datasets.

LSHBloom: Memory-efficient, Extreme-scale Document Deduplication

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理