Best Practices for Distilling Large Language Models into BERT for Web Search Ranking

作者: Dezhi Ye, Junwei Hu, Jiabin Fan, Bowen Tian, Jie Liu, Haijin Liang, Jin Ma

分类: cs.IR, cs.CL

发布日期: 2024-11-07

备注: Arxiv Version

💡 一句话要点

提出蒸馏技术将大型语言模型转化为BERT以优化网页搜索排名

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 大型语言模型 BERT 搜索排名 蒸馏技术 信息检索 排名损失 资源优化

📋 核心要点

现有大型语言模型在搜索排名中的应用受限于其高昂的计算成本，难以直接用于商业系统。
本文提出通过蒸馏技术将大型语言模型的排名能力转移至更小的BERT模型，利用排名损失进行训练。
实验结果表明，该方法在离线和在线评估中均表现出色，成功集成至商业搜索引擎，提升了搜索效率。

📝 摘要（中文）

近期研究表明，大型语言模型（LLMs）在零样本相关性排名方面具有显著潜力，但其高昂的成本限制了其在商业搜索系统中的直接应用。为此，本文探讨了将LLMs的排名能力转移至更紧凑的BERT模型的技术，通过排名损失实现资源更节省的模型部署。我们通过持续预训练增强LLMs的训练，输入查询并将点击的标题和摘要作为输出，随后使用排名损失进行监督微调。我们还引入了混合的点-wise和边际均方误差损失，以有效转移排名知识。离线和在线评估验证了我们方法的有效性，并于2024年2月成功集成至商业搜索引擎中。

🔬 方法详解

问题定义：本文旨在解决大型语言模型在商业搜索系统中因高成本而无法直接应用的问题，现有方法在资源受限环境下表现不佳。

核心思路：通过蒸馏技术将大型语言模型的排名知识转移至BERT模型，利用排名损失进行训练，以实现资源节省的模型部署。

技术框架：整体流程包括持续预训练和监督微调两个主要阶段。首先，输入查询并生成点击的标题和摘要作为输出；然后，使用排名损失对模型进行微调。

关键创新：引入混合的点-wise和边际均方误差损失，能够有效转移大型语言模型的排名知识至小型模型，显著提升了模型的效率与效果。

关键设计：在训练过程中，最终的token被视为整个句子的代表，损失函数设计为结合了点-wise和边际损失，以优化模型的排名能力。具体参数设置和网络结构细节在实验中进行了验证与优化。

🖼️ 关键图片

📊 实验亮点

实验结果显示，所提出的方法在离线评估中相较于基线模型提升了搜索排名的准确性，并在在线评估中成功集成至商业搜索引擎，验证了其在实际应用中的有效性与可行性。

🎯 应用场景

该研究的潜在应用领域包括商业搜索引擎、信息检索系统及其他需要高效文本排名的场景。通过将大型语言模型的优势转化为资源节省的模型，能够在计算资源有限的环境中实现高效的搜索体验，具有重要的实际价值和广泛的应用前景。

📄 摘要（原文）

Recent studies have highlighted the significant potential of Large Language Models (LLMs) as zero-shot relevance rankers. These methods predominantly utilize prompt learning to assess the relevance between queries and documents by generating a ranked list of potential documents. Despite their promise, the substantial costs associated with LLMs pose a significant challenge for their direct implementation in commercial search systems. To overcome this barrier and fully exploit the capabilities of LLMs for text ranking, we explore techniques to transfer the ranking expertise of LLMs to a more compact model similar to BERT, using a ranking loss to enable the deployment of less resource-intensive models. Specifically, we enhance the training of LLMs through Continued Pre-Training, taking the query as input and the clicked title and summary as output. We then proceed with supervised fine-tuning of the LLM using a rank loss, assigning the final token as a representative of the entire sentence. Given the inherent characteristics of autoregressive language models, only the final token can encapsulate all preceding tokens. Additionally, we introduce a hybrid point-wise and margin MSE loss to transfer the ranking knowledge from LLMs to smaller models like BERT. This method creates a viable solution for environments with strict resource constraints. Both offline and online evaluations have confirmed the efficacy of our approach, and our model has been successfully integrated into a commercial web search engine as of February 2024.

Best Practices for Distilling Large Language Models into BERT for Web Search Ranking

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理