DoPAMine: Domain-specific Pre-training Adaptation from seed-guided data Mining
作者: Vinayak Arannil, Neha Narwal, Sourav Sanjukta Bhabesh, Sai Nikhil Thirandas, Darren Yow-Bang Wang, Graham Horwood, Alex Anto Chirayath, Gouri Pandeshwar
分类: cs.CL, cs.AI, cs.LG
发布日期: 2024-09-30 (更新: 2024-10-09)
💡 一句话要点
提出DoPAMine以解决低资源行业领域的预训练数据不足问题
🎯 匹配领域: 支柱九:具身大模型 (Embodied Foundation Models)
关键词: 领域适应 数据挖掘 大型语言模型 医疗AI 金融AI 预训练 种子数据生成
📋 核心要点
- 现有大型语言模型在专业领域的应用受到数据稀缺和真实性不足的限制,尤其是在医疗和金融等低资源行业。
- 本文提出的DoPAMine框架通过种子引导的数据挖掘,自动化地从大数据集中提取领域特定的真实训练数据,解决了数据获取的难题。
- 实验结果显示,DoPAMine在医疗和金融任务中,分别在零-shot和5-shot设置下提升了4.9%和5.1%的性能,验证了其有效性。
📝 摘要(中文)
大型语言模型(LLMs)在多个行业领域展示了出色的泛化能力,但在专业或低资源行业领域的表现存在局限性。现有方法通常依赖于生成的领域特定合成数据,然而这些数据往往缺乏真实性和复杂性。为此,本文提出了一种自动化且可扩展的框架——DoPAMine,通过种子引导的数据挖掘,从大数据集中提取领域特定的训练数据,以实现语言模型的领域适应。实验表明,DoPAMine在医疗和金融领域的预训练中显著提升了模型的性能。
🔬 方法详解
问题定义:本文旨在解决大型语言模型在专业领域(如医疗和金融)中由于缺乏真实训练数据而导致的性能不足问题。现有方法依赖于合成数据,往往缺乏真实性和复杂性,无法满足实际应用需求。
核心思路:DoPAMine框架通过利用大型语言模型的参数知识,生成多样化的种子数据,进而从大规模数据集中挖掘真实的领域特定数据。这种方法旨在提高数据的真实性和代表性,以支持模型的领域适应。
技术框架:DoPAMine的整体架构包括两个主要阶段:首先,利用LLM生成针对特定领域的种子数据;其次,从如Common Crawl等大数据集中挖掘与这些种子数据相关的真实数据。
关键创新:DoPAMine的创新之处在于其自动化和可扩展性,能够有效地从大规模数据集中提取领域特定数据,而不是依赖于人工标注或合成数据。这一方法与现有的领域适应技术相比,显著提高了数据的真实性和复杂性。
关键设计:在实现过程中,DoPAMine采用了特定的参数设置和损失函数,以确保生成的种子数据能够有效代表目标领域。此外,框架设计中还考虑了数据挖掘的效率和准确性,以提升整体性能。
🖼️ 关键图片
📊 实验亮点
实验结果表明,DoPAMine在医疗任务中相较于基线模型提升了4.9%(零-shot)和5.1%(5-shot)的性能,在金融任务中则提升了2.9%(零-shot)和6.7%(5-shot)。这些结果验证了DoPAMine在领域适应中的有效性和优势。
🎯 应用场景
DoPAMine框架具有广泛的应用潜力,尤其在医疗和金融等领域,可以为低资源行业提供高质量的训练数据,进而提升相关模型的性能。这一方法的成功应用将推动行业内AI技术的发展,促进更智能化的决策支持系统的构建。
📄 摘要(原文)
Large Language Models (LLMs) have shown remarkable ability to generalize effectively across numerous industry domains while executing a range of tasks. Many of these competencies are obtained from the data utilized during the pre-training phase of the Language Models (LMs). However, these models exhibit limitations when tasked with performing in specialized or low-resource industry domains. More recent approaches use LLMs for generating domain-specific synthetic data but most often they lack in truthfulness and complexity. Alternatively, in cases where domain data is available like healthcare and finance most of the LMs are proprietary necessitating the need for a scalable method to curate real world industry specific pre-training data. In this work, we propose an automated and scalable framework - DoPAMine:Domain-specific Pre-training Adaptation from seed-guided data Mining, to mine domain specific training data from a large data corpus for domain adaptation of a LM. The framework leverages the parametric knowledge of a LLM to generate diverse and representative seed data tailored to a specific domain which is then used to mine real world data from a large data corpus like Common Crawl. We evaluated our framework's performance in the continual pre-training (CPT) setting by training two domain specific 7B parameter LMs in healthcare and finance with data mined via DoPAMine. Our experiments show that DoPAMine boosts the performance of pre-trained LLMs on average by 4.9% and 5.1% in zero-shot and 5-shot settings respectively on healthcare tasks from MMLU, MedQA, MedMCQA and PubMedQA datasets, and 2.9% and 6.7% for zero-shot and 5-shot settings respectively on finance tasks from FiQA-SA, FPB and Headlines datasets when compared to the baseline.