Deconstructing In-Context Learning: Understanding Prompts via Corruption

📄 arXiv: 2404.02054v2 📥 PDF

作者: Namrata Shivagunde, Vladislav Lialin, Sherin Muckatira, Anna Rumshisky

分类: cs.CL

发布日期: 2024-04-02 (更新: 2024-05-29)

备注: Accepted to LREC-COLING 2024 main conference. The code is available at https://github.com/text-machine-lab/Understanding_prompts_via_corruption


💡 一句话要点

通过腐蚀分析提出新方法以理解上下文学习中的提示

🎯 匹配领域: 支柱九:具身大模型 (Embodied Foundation Models)

关键词: 大型语言模型 上下文学习 提示设计 模型评估 语义腐蚀 AI助手 鲁棒性

📋 核心要点

  1. 现有的预训练大型语言模型在面对提示的微小修改时表现脆弱,评估其质量的少量示例评估方法敏感性高。
  2. 本文将提示分解为四个组成部分,研究其结构和语义腐蚀对模型性能的影响,以提升模型的鲁棒性。
  3. 研究表明,重复文本可提升模型性能,且较大模型对提示语义的敏感性更高,添加指令即使在语义腐蚀下也能增强性能。

📝 摘要(中文)

大型语言模型(LLMs)在提供的提示下具备“上下文学习”的能力,推动了AI助手如ChatGPT等的广泛应用。然而,现有的预训练LLMs在面对提示的微小修改时表现脆弱。本文通过将提示分解为任务描述、示例输入、标签和内联指令四个组成部分,研究这些元素的结构和语义腐蚀对模型性能的影响。研究发现,重复文本可以提升模型性能,且较大模型(≥30B)对提示的语义更为敏感。此外,即使内联指令语义受到腐蚀,添加任务和内联指令也能增强模型性能。

🔬 方法详解

问题定义:本文旨在解决大型语言模型在提示微小修改下表现脆弱的问题,现有方法在评估模型质量时常常受到提示选择的影响。

核心思路:通过将提示分解为任务描述、示例输入、标签和内联指令四个部分,研究这些部分的结构和语义腐蚀如何影响模型性能,从而提供更系统的理解。

技术框架:研究采用了多种规模的模型(1.5B到70B),使用十个数据集进行分类和生成任务的评估,分析不同提示元素的影响。

关键创新:将提示分解为多个组成部分并系统研究其影响是本文的主要创新,与以往研究集中于特定属性的方式不同,提供了更全面的视角。

关键设计:实验中发现,提示中重复文本的使用可以显著提升模型性能,且较大模型对提示的语义变化更为敏感,添加任务和内联指令即使在语义腐蚀下也能提升性能。

🖼️ 关键图片

fig_0
fig_1
fig_2

📊 实验亮点

实验结果显示,重复文本的使用显著提升了模型性能,尤其是在较大模型(≥30B)中,模型对提示的语义变化表现出更高的敏感性。此外,添加任务和内联指令即使在语义腐蚀的情况下也能增强模型性能,显示出提示设计的重要性。

🎯 应用场景

该研究为大型语言模型的提示设计提供了新的视角,能够在实际应用中提升模型的鲁棒性和性能,尤其是在需要高可靠性的AI助手和自动化系统中。未来,这一方法可能推动更高效的模型训练和评估策略的开发。

📄 摘要(原文)

The ability of large language models (LLMs) to $``$learn in context$"$ based on the provided prompt has led to an explosive growth in their use, culminating in the proliferation of AI assistants such as ChatGPT, Claude, and Bard. These AI assistants are known to be robust to minor prompt modifications, mostly due to alignment techniques that use human feedback. In contrast, the underlying pre-trained LLMs they use as a backbone are known to be brittle in this respect. Building high-quality backbone models remains a core challenge, and a common approach to assessing their quality is to conduct few-shot evaluation. Such evaluation is notorious for being highly sensitive to minor prompt modifications, as well as the choice of specific in-context examples. Prior work has examined how modifying different elements of the prompt can affect model performance. However, these earlier studies tended to concentrate on a limited number of specific prompt attributes and often produced contradictory results. Additionally, previous research either focused on models with fewer than 15 billion parameters or exclusively examined black-box models like GPT-3 or PaLM, making replication challenging. In the present study, we decompose the entire prompt into four components: task description, demonstration inputs, labels, and inline instructions provided for each demonstration. We investigate the effects of structural and semantic corruptions of these elements on model performance. We study models ranging from 1.5B to 70B in size, using ten datasets covering classification and generation tasks. We find that repeating text within the prompt boosts model performance, and bigger models ($\geq$30B) are more sensitive to the semantics of the prompt. Finally, we observe that adding task and inline instructions to the demonstrations enhances model performance even when the instructions are semantically corrupted.