IntentGrasp: A Comprehensive Benchmark for Intent Understanding

作者: Yuwei Yin, Chuyuan Li, Giuseppe Carenini

分类: cs.CL, cs.AI, cs.LG

发布日期: 2026-05-07

备注: IntentGrasp data is available on Hugging Face, and the code is released on GitHub

💡 一句话要点

提出IntentGrasp基准与意图微调（IFT）方法，显著提升大语言模型的意图理解能力

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 意图理解 大语言模型 基准测试 指令微调 跨领域泛化 自然语言处理

📋 核心要点

核心问题：现有大语言模型在复杂意图理解任务上表现欠佳，在挑战性测试集上的准确率甚至低于随机猜测，缺乏对人类意图的深度解析能力。
方法要点：构建了包含多领域、大规模数据的IntentGrasp基准，并提出了意图微调（IFT）策略，通过统一任务格式和针对性训练强化模型的意图识别能力。
实验效果：IFT方法在All Set和Gem Set上分别实现了30+和20+ F1点的性能提升，且在留一领域实验中展现了卓越的跨领域泛化性能。

📝 摘要（中文）

准确理解语音、对话及文本背后的意图，对于构建高效的大语言模型（LLM）助手至关重要。本文提出了IntentGrasp，这是一个旨在评估LLM意图理解能力的综合基准。该基准整合了来自12个不同领域的49个高质量开源语料库，通过数据集整理、意图标签语境化及任务格式统一构建而成。IntentGrasp包含262,759个训练实例，以及包含12,909个测试用例的“All Set”和470个更具挑战性的“Gem Set”。对7个模型家族中20个LLM的广泛评估显示，模型表现不尽如人意，在All Set上得分低于60%，在Gem Set上低于25%。值得注意的是，17个模型在Gem Set上的表现甚至低于随机猜测基准（15.2%），而人类表现约为81.1%。为解决此问题，本文提出了意图微调（IFT）方法，通过在IntentGrasp训练集上进行微调，模型在All Set和Gem Set上分别取得了30+和20+ F1点的显著提升。留一领域（Lodo）实验进一步验证了IFT强大的跨领域泛化能力。

🔬 方法详解

问题定义：论文旨在解决大语言模型在理解复杂、多领域人类意图时的准确性不足问题。现有模型往往难以捕捉语境中的深层意图，导致在处理非结构化输入时表现出较差的鲁棒性和泛化能力。

核心思路：通过构建大规模、高质量的意图理解基准IntentGrasp，将意图识别任务标准化，并利用意图微调（IFT）策略，使模型在多样化的语境中学习意图的语义表示，从而弥补通用预训练模型在特定意图分类任务上的短板。

技术框架：该框架包含三个核心阶段：首先是数据集的 curation，从49个开源语料库中提取意图数据；其次是标签语境化，确保意图标签在不同领域间具有一致的语义解释；最后是任务格式统一，将所有数据转化为统一的指令微调格式，用于训练和评估。

关键创新：最重要的创新在于构建了兼顾广度（All Set）与深度（Gem Set）的综合基准，并提出了针对意图理解的微调范式。该范式不仅提升了模型在已知领域的表现，更通过留一领域（Lodo）实验证明了其在未见领域中的强泛化能力。

关键设计：采用了大规模指令微调策略，通过对26万余条实例进行监督学习，优化模型在多领域意图分类任务上的损失函数。设计中特别强调了数据集的平衡性与挑战性，以确保模型能够处理复杂的长尾意图分布。

🖼️ 关键图片

📊 实验亮点

实验覆盖了7个模型家族的20个前沿LLM，结果显示模型在Gem Set上的表现普遍低于15.2%的随机基准，反映了当前模型在复杂意图理解上的巨大瓶颈。通过IFT微调，模型在All Set上F1分数提升超30点，在Gem Set上提升超20点，证明了该方法在弥补模型能力鸿沟方面的显著成效。

🎯 应用场景

该研究成果可广泛应用于智能客服、个人AI助手、任务导向型对话系统及自动化办公工具。通过提升模型对用户深层意图的精准捕捉能力，能够显著改善人机交互的自然度与任务完成效率，推动AI助手向更具意图感知、更安全且更具实用价值的方向发展。

📄 摘要（原文）

Accurately understanding the intent behind speech, conversation, and writing is crucial to the development of helpful Large Language Model (LLM) assistants. This paper introduces IntentGrasp, a comprehensive benchmark for evaluating the intent understanding capability of LLMs. Derived from 49 high-quality, open-licensed corpora spanning 12 diverse domains, IntentGrasp is constructed through source datasets curation, intent label contextualization, and task format unification. IntentGrasp contains a large-scale training set of 262,759 instances and two evaluation sets: an All Set of 12,909 test cases and a more balanced and challenging Gem Set of 470 cases. Extensive evaluations on 20 LLMs across 7 families (including frontier models such as GPT-5.4, Gemini-3.1-Pro, and Claude-Opus-4.7) demonstrate unsatisfactory performance, with scores below 60% on All Set and below 25% on Gem set. Notably, 17 out of 20 tested models perform worse than a random-guess baseline (15.2%) on Gem Set, while the estimated human performance is ~81.1%, showing substantial room for improvement. To enhance such ability, this paper proposes Intentional Fine-Tuning (IFT), which fine-tunes the models on the training set in IntentGrasp, yielding significant gains of 30+ F1 points on All Set and 20+ points on Gem Set. Tellingly, the leave-one-domain-out (Lodo) experiments further demonstrate the strong cross-domain generalizability of IFT, verifying that it is a promising approach to substantially enhancing the intent understanding of LLMs. Overall, by benchmarking and boosting intent understanding ability, this study sheds light on a promising path towards more intentional, capable, and safe AI assistants for human benefits and social good.

IntentGrasp: A Comprehensive Benchmark for Intent Understanding

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理