The Lucie-7B LLM and the Lucie Training Dataset: Open resources for multilingual language generation

作者: Olivier Gouvert, Julie Hunter, Jérôme Louradour, Christophe Cerisara, Evan Dufraisse, Yaya Sy, Laura Rivière, Jean-Pierre Lorré, OpenLLM-France community

分类: cs.CL, cs.AI

发布日期: 2025-03-15

💡 一句话要点

发布Lucie-7B多语言大模型及训练数据集，着重解决现有模型中以英语为中心的偏见问题。

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 多语言模型 法语 数据集构建 数据版权 开放资源

📋 核心要点

现有大型语言模型预训练数据集普遍存在以英语为中心的偏见，忽略了其他语言和文化的特性。
论文构建了以法语为中心的多语言数据集Lucie，并训练了Lucie-7B模型，旨在更好地代表法语社区的文化。
实验结果表明，Lucie-7B模型在保证数据版权的前提下，能够达到与现有先进模型相媲美的性能。

📝 摘要（中文）

本文介绍了Lucie训练数据集和Lucie-7B基础模型。Lucie训练数据集是一个以法语为中心的多语言文本语料库，旨在抵消大型语言模型预训练数据集中常见的以英语为中心的偏见。其法语数据不仅来自传统的网络资源，还来自法国文化遗产文档，填补了现代数据集中的一个重要空白。除了法语（占据数据的大部分），还添加了支持其他几种欧洲语言（包括英语、西班牙语、德语和意大利语）的文档。除了作为法语语言和文化的资源外，该数据集的一个重要特征是它通过最小化受版权保护的材料来优先考虑数据权利。此外，秉承过去开放项目的理念，它以用于训练的形式重新分发，其处理过程在Hugging Face和GitHub上进行了描述。Lucie-7B基础模型使用法语和英语的等量数据（各约33%）进行训练，旨在更好地代表法语社区的文化方面。还介绍了两个指令微调模型Lucie-7B-Instruct-v1.1和Lucie-7B-Instruct-human-data，作为Lucie-7B的用例演示发布。与最先进的模型相比，这些模型取得了可喜的成果，表明优先考虑数据权利的开放方法仍然可以提供强大的性能。这些模型被视为朝着在不久的将来开发更高性能、对齐模型的初步步骤。Lucie-7B的模型权重和Lucie指令模型，以及前者的中间检查点，都发布在Hugging Face上，而模型训练和数据准备代码可在GitHub上找到。这使得Lucie-7B成为根据新的OSI定义的首批符合OSI标准的语言模型之一。

🔬 方法详解

问题定义：现有的大型语言模型（LLM）训练数据集往往以英语为中心，导致模型在处理其他语言，特别是法语等语言时，无法充分理解其文化背景和细微差别。此外，许多数据集包含受版权保护的材料，限制了其开放性和可访问性。

核心思路：论文的核心思路是构建一个以法语为中心的多语言数据集，并在此基础上训练一个基础模型，从而减少以英语为中心的偏见，并更好地支持法语社区的文化。同时，数据集的构建过程注重数据版权，尽量减少受版权保护的材料的使用，保证数据集的开放性和可访问性。

技术框架：整体框架包括两个主要部分：数据集构建和模型训练。数据集构建阶段，收集了来自网络资源和法国文化遗产文档的文本数据，涵盖法语、英语、西班牙语、德语和意大利语等多种语言，其中法语数据占据主导地位。模型训练阶段，使用收集到的数据集训练Lucie-7B基础模型，并在此基础上进行指令微调，得到Lucie-7B-Instruct-v1.1和Lucie-7B-Instruct-human-data两个指令微调模型。

关键创新：该论文的关键创新在于构建了一个以法语为中心，同时兼顾数据版权的多语言数据集。该数据集不仅包含传统的网络资源，还包含了法国文化遗产文档，填补了现有数据集的空白。此外，该数据集的构建过程注重数据版权，尽量减少受版权保护的材料的使用，保证数据集的开放性和可访问性。

关键设计：Lucie-7B模型使用法语和英语的等量数据（各约33%）进行训练，旨在更好地代表法语社区的文化。数据集的处理过程在Hugging Face和GitHub上进行了详细描述，保证了数据集的可复现性和可扩展性。指令微调模型使用了标准的技术，但重点在于利用高质量的指令数据来提升模型的性能。

🖼️ 关键图片

📊 实验亮点

Lucie-7B模型在指令微调后，与现有先进模型相比取得了可喜的成果，表明在保证数据版权的前提下，仍然可以训练出高性能的语言模型。具体性能数据未在摘要中给出，需要在论文正文中查找。

🎯 应用场景

该研究成果可应用于多语言自然语言处理、机器翻译、文化遗产保护等领域。Lucie-7B模型可以作为基础模型，用于开发各种法语相关的应用，例如法语文本生成、法语对话系统等。该数据集可以作为研究资源，用于研究多语言自然语言处理和文化差异。

📄 摘要（原文）

We present both the Lucie Training Dataset and the Lucie-7B foundation model. The Lucie Training Dataset is a multilingual collection of textual corpora centered around French and designed to offset anglo-centric biases found in many datasets for large language model pretraining. Its French data is pulled not only from traditional web sources, but also from French cultural heritage documents, filling an important gap in modern datasets. Beyond French, which makes up the largest share of the data, we added documents to support several other European languages, including English, Spanish, German, and Italian. Apart from its value as a resource for French language and culture, an important feature of this dataset is that it prioritizes data rights by minimizing copyrighted material. In addition, building on the philosophy of past open projects, it is redistributed in the form used for training and its processing is described on Hugging Face and GitHub. The Lucie-7B foundation model is trained on equal amounts of data in French and English -- roughly 33% each -- in an effort to better represent cultural aspects of French-speaking communities. We also describe two instruction fine-tuned models, Lucie-7B-Instruct-v1.1 and Lucie-7B-Instruct-human-data, which we release as demonstrations of Lucie-7B in use. These models achieve promising results compared to state-of-the-art models, demonstrating that an open approach prioritizing data rights can still deliver strong performance. We see these models as an initial step toward developing more performant, aligned models in the near future. Model weights for Lucie-7B and the Lucie instruct models, along with intermediate checkpoints for the former, are published on Hugging Face, while model training and data preparation code is available on GitHub. This makes Lucie-7B one of the first OSI compliant language models according to the new OSI definition.

The Lucie-7B LLM and the Lucie Training Dataset: Open resources for multilingual language generation

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理