Regurgitative Training: The Value of Real Data in Training Large Language Models

作者: Jinghui Zhang, Dandan Qiao, Mochen Yang, Qiang Wei

分类: cs.CL, cs.AI, stat.ML

发布日期: 2024-07-03 (更新: 2024-07-25)

💡 一句话要点

研究表明：使用LLM生成数据进行再训练会显著降低LLM性能

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 大型语言模型 回流训练 数据质量 机器翻译 数据增强

📋 核心要点

现有LLM训练面临挑战：网络上越来越多的数据由LLM生成，直接使用这些数据进行训练可能损害模型性能。
论文核心思想：通过实验分析LLM生成数据再训练对模型性能的影响，并探究性能下降的根本原因。
实验结果表明：使用LLM生成数据进行再训练会显著降低模型性能，并提出了缓解性能下降的三种策略。

📝 摘要（中文）

大型语言模型（LLM）的爆炸性成功意味着大量在线内容将由LLM而非人类生成，这将不可避免地进入下一代LLM的训练数据集。本文评估了这种“回流训练”对LLM性能的影响。通过使用自身或其他LLM生成的数据对GPT-3.5进行微调，完成机器翻译任务，我们发现强有力的证据表明，回流训练明显会降低LLM的性能。在从头开始训练的Transformer模型上也观察到了相同的性能损失。我们发现，回流训练的性能劣势至少可以归因于两种机制：（1）更高的错误率和（2）与真实数据相比，LLM生成的数据中较低的词汇多样性。基于这些机制，我们提出并评估了三种不同的策略来减轻回流训练的性能损失。结果表明，真实的人工生成数据在训练LLM中具有重要价值，而合成的LLM生成数据无法轻易替代。

🔬 方法详解

问题定义：论文旨在研究使用LLM生成的数据（即“回流数据”）训练新的LLM，会对模型性能产生什么影响。现有方法直接使用所有在线数据，忽略了其中可能包含大量LLM生成的内容，这可能导致模型性能下降，尤其是在数据质量和多样性方面存在问题。

核心思路：核心思路是通过实验对比使用真实数据和LLM生成数据训练LLM的性能差异，并分析导致性能差异的原因。然后，针对这些原因，提出相应的策略来缓解使用LLM生成数据训练带来的性能损失。

技术框架：整体框架包括以下几个阶段：1) 使用GPT-3.5或其他LLM生成机器翻译任务的数据。2) 使用这些生成的数据以及真实数据，分别对GPT-3.5进行微调，或者从头训练Transformer模型。3) 评估微调后或训练后的模型在机器翻译任务上的性能。4) 分析LLM生成数据和真实数据在错误率和词汇多样性方面的差异。5) 提出并评估缓解性能损失的策略，包括基于数据质量的排序训练、混合不同LLM生成的数据、以及基于AI检测的排序训练。

关键创新：关键创新在于首次系统性地研究了“回流训练”对LLM性能的影响，并提出了缓解性能损失的策略。以往的研究较少关注LLM生成数据对后续模型训练的潜在负面影响。

关键设计：论文的关键设计包括：1) 使用机器翻译任务作为评估LLM性能的基准。2) 设计了数据驱动的指标来衡量LLM生成数据的质量，并用于排序训练。3) 提出了混合不同LLM生成数据的方法，以提高词汇多样性。4) 使用AI检测分类器来区分LLM生成数据和人工生成数据，并根据与人工生成数据的相似度进行排序训练。

🖼️ 关键图片

📊 实验亮点

实验结果表明，使用LLM生成数据进行再训练会显著降低模型性能。例如，在机器翻译任务中，使用LLM生成数据微调GPT-3.5后，BLEU得分显著低于使用真实数据微调的模型。提出的三种缓解策略在一定程度上可以改善性能，但仍无法完全弥补与使用真实数据训练的差距。这突出了高质量人工生成数据在LLM训练中的重要性。

🎯 应用场景

该研究成果可应用于LLM训练数据的选择和清洗，帮助开发者更好地利用在线数据，避免因使用低质量的LLM生成数据而损害模型性能。同时，该研究也为未来LLM训练数据的构建提供了新的思路，例如，如何设计更好的数据生成策略，以生成更接近真实数据的高质量合成数据。

📄 摘要（原文）

What happens if we train a new Large Language Model (LLM) using data that are at least partially generated by other LLMs? The explosive success of LLMs means that a substantial amount of content online will be generated by LLMs rather than humans, which will inevitably enter the training datasets of next-generation LLMs. We evaluate the implications of such "regurgitative training" on LLM performance. Through fine-tuning GPT-3.5 with data generated either by itself or by other LLMs in a machine translation task, we find strong evidence that regurgitative training clearly handicaps the performance of LLMs. The same performance loss of regurgitative training is observed on transformer models that we train from scratch. We find suggestive evidence that the performance disadvantage of regurgitative training can be attributed to at least two mechanisms: (1) higher error rates and (2) lower lexical diversity in LLM-generated data as compared to real data. Based on these mechanisms, we propose and evaluate three different strategies to mitigate the performance loss of regurgitative training. First, we devise data-driven metrics to gauge the quality of each LLM-generated data instance, and then carry out an ordered training process where high-quality data are added before low-quality ones. Second, we combine data generated by multiple different LLMs (as an attempt to increase lexical diversity). Third, we train an AI detection classifier to differentiate between LLM- and human-generated data, and include LLM-generated data in the order of resemblance to human-generated data. All three strategies can improve the performance of regurgitative training to some extent but are not always able to fully close the gap from training with real data. Our results highlight the value of real, human-generated data in training LLMs, which cannot be easily substituted by synthetic, LLM-generated data.

Regurgitative Training: The Value of Real Data in Training Large Language Models

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理