Beyond Memorization: Assessing Semantic Generalization in Large Language Models Using Phrasal Constructions
作者: Wesley Scivetti, Melissa Torgbi, Austin Blodgett, Mollie Shichman, Taylor Hudson, Claire Bonial, Harish Tayyar Madabushi
分类: cs.CL, cs.AI
发布日期: 2025-01-08 (更新: 2025-08-13)
💡 一句话要点
构建评估框架以提升大语言模型的语义泛化能力
🎯 匹配领域: 支柱九:具身大模型 (Embodied Foundation Models)
关键词: 语言模型 语义泛化 构式语法 自然语言处理 评估框架 数据集构建 心理语言学
📋 核心要点
- 现有大语言模型在处理不常见的语言实例时,往往无法有效泛化,导致理解能力受限。
- 本文通过构式语法构建了一种新的评估框架,旨在系统性地测试模型的语义理解和泛化能力。
- 实验结果显示,当前模型在处理句法相同但语义不同的构式时,性能下降超过40%,反映出其泛化能力不足。
📝 摘要(中文)
随着网络规模的预训练数据的增加,评估语言模型的语言能力面临挑战,尤其是在处理不常见的动态真实语言实例时。为此,本文构建了一种基于构式语法的诊断评估,系统性地评估大语言模型的自然语言理解能力。我们提出的推理评估数据集包含英语短语构式,旨在检验模型是否能够理解语义并在句法相同但意义不同的构式中应用适当的构式语义。实验结果表明,当前最先进的模型在处理这些任务时表现出超过40%的性能下降,揭示了其在语义泛化方面的不足。我们将数据集及实验数据公开,供研究者使用。
🔬 方法详解
问题定义:本文旨在解决大语言模型在处理不常见语言实例时的泛化能力不足的问题。现有方法未能有效区分模型在预训练数据中常见与不常见实例的理解能力。
核心思路:通过构式语法(CxG)构建评估框架,明确将句法形式与抽象非词汇意义相联系,从而测试模型的语义理解能力和泛化能力。
技术框架:整体架构包括数据集构建、模型评估和结果分析三个主要模块。数据集由英语短语构式组成,评估模型在不同构式下的语义理解能力。
关键创新:本研究的创新点在于利用构式语法提供的心理语言学基础,系统性地评估模型的语义理解能力,尤其是在句法相同但语义不同的情况下。
关键设计:在数据集构建中,选择了多种短语构式,并设计了相应的评估任务,确保模型在理解和生成方面的挑战性。
🖼️ 关键图片
📊 实验亮点
实验结果显示,最先进的模型在处理句法相同但语义不同的构式时,性能下降超过40%。这一发现揭示了当前模型在语义泛化方面的显著不足,为未来的研究提供了重要的改进方向。
🎯 应用场景
该研究的评估框架可广泛应用于自然语言处理领域,尤其是在提升大语言模型的语义理解和泛化能力方面。通过公开的数据集,研究者可以进一步探索模型在不同语言实例下的表现,推动语言模型的实际应用和理论研究。
📄 摘要(原文)
The web-scale of pretraining data has created an important evaluation challenge: to disentangle linguistic competence on cases well-represented in pretraining data from generalization to out-of-domain language, specifically the dynamic, real-world instances less common in pretraining data. To this end, we construct a diagnostic evaluation to systematically assess natural language understanding in LLMs by leveraging Construction Grammar (CxG). CxG provides a psycholinguistically grounded framework for testing generalization, as it explicitly links syntactic forms to abstract, non-lexical meanings. Our novel inference evaluation dataset consists of English phrasal constructions, for which speakers are known to be able to abstract over commonplace instantiations in order to understand and produce creative instantiations. Our evaluation dataset uses CxG to evaluate two central questions: first, if models can 'understand' the semantics of sentences for instances that are likely to appear in pretraining data less often, but are intuitive and easy for people to understand. Second, if LLMs can deploy the appropriate constructional semantics given constructions that are syntactically identical but with divergent meanings. Our results demonstrate that state-of-the-art models, including GPT-o1, exhibit a performance drop of over 40% on our second task, revealing a failure to generalize over syntactically identical forms to arrive at distinct constructional meanings in the way humans do. We make our novel dataset and associated experimental data, including prompts and model responses, publicly available.