A Global Dataset Mapping the AI Innovation from Academic Research to Industrial Patents

作者: Haixing Gong, Hui Zou, Xingzhou Liang, Shiyuan Meng, Pinlong Cai, Xingcheng Xu, Jingjing Qu

分类: cs.DB, cs.AI, cs.DL

发布日期: 2025-03-12 (更新: 2025-05-30)

备注: 38 pages and 4 figures

💡 一句话要点

构建DeepInnovationAI数据集，用于分析AI学术研究到工业专利的创新转移模式

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 人工智能 技术转移 数据集 专利分析 学术论文 自然语言处理 BERT分类器 超图分析

📋 核心要点

现有数据基础设施存在碎片化、覆盖不完整和评估能力不足的问题，难以有效分析AI领域的创新模式和技术转移。
DeepInnovationAI数据集利用大型语言模型、多语言文本分析和超图分析等技术，构建了包含专利、论文和论文-专利对的综合性数据集。
该数据集具有广泛的时间和地域覆盖范围，支持对技术发展模式和国际竞争动态的详细分析，为AI创新和技术转移建模奠定基础。

📝 摘要（中文）

为了解人工智能（AI）领域快速发展中的创新模式，以及从研究到应用的高效技术转移，本文提出了DeepInnovationAI，一个包含三个结构化文件的综合性全球数据集。DeepPatentAI.csv包含2,356,204条专利记录，具有8个领域特定的属性。DeepDiveAI.csv包含3,511,929篇学术出版物，具有13个元数据字段。这两个数据集利用大型语言模型、多语言文本分析和双层BERT分类器来准确识别与AI相关的内容，并利用超图分析来创建稳健的创新指标。此外，DeepCosineAI.csv通过应用语义向量邻近性分析，包含3,511,929个最相关的论文-专利对，每个对由3个元数据字段描述，以促进潜在知识流的识别。DeepInnovationAI使研究人员、政策制定者和行业领导者能够预测趋势并识别合作机会。凭借广泛的时间和地域范围，它支持对技术发展模式和国际竞争动态的详细分析，为建模AI创新和技术转移过程奠定基础。

🔬 方法详解

问题定义：论文旨在解决人工智能领域学术研究成果向工业专利转移的创新模式难以追踪和分析的问题。现有数据分散、覆盖不全，缺乏有效的评估指标，阻碍了对技术转移过程的深入理解和建模。

核心思路：论文的核心思路是构建一个综合性的数据集，将学术论文和工业专利关联起来，并利用自然语言处理和图分析技术，提取关键信息，建立创新指标，从而揭示技术转移的模式和规律。

技术框架：DeepInnovationAI数据集包含三个主要部分：DeepPatentAI（专利数据）、DeepDiveAI（论文数据）和DeepCosineAI（论文-专利对数据）。DeepPatentAI和DeepDiveAI分别使用大型语言模型和BERT分类器识别AI相关内容，并提取元数据。DeepCosineAI使用语义向量邻近性分析建立论文-专利对的关联。整个框架利用超图分析构建创新指标。

关键创新：该数据集的关键创新在于其综合性和关联性，它不仅包含了大量的专利和论文数据，而且通过语义分析将它们关联起来，从而能够分析知识的流动和转移。此外，利用超图分析构建创新指标，为评估技术转移的效果提供了新的方法。

关键设计：双层BERT分类器用于识别AI相关内容，第一层用于粗粒度分类，第二层用于细粒度分类。语义向量邻近性分析使用余弦相似度计算论文和专利之间的语义距离。超图分析用于构建创新网络，节点代表论文和专利，边代表它们之间的引用关系或语义关联。

📊 实验亮点

DeepInnovationAI数据集包含超过两百万条专利记录和三百万条学术出版物，覆盖了广泛的时间和地域范围。通过双层BERT分类器，AI相关内容的识别准确率得到了显著提升。语义向量邻近性分析能够有效地识别相关的论文-专利对，为分析知识流动提供了有力工具。

🎯 应用场景

该研究成果可应用于人工智能领域的科技情报分析、技术转移评估、创新政策制定等方面。研究人员可以利用该数据集分析AI技术的发展趋势和竞争格局，政策制定者可以评估创新政策的效果，企业可以寻找潜在的合作机会。

📄 摘要（原文）

In the rapidly evolving field of artificial intelligence (AI), mapping innovation patterns and understanding effective technology transfer from research to applications are essential for economic growth. However, existing data infrastructures suffer from fragmentation, incomplete coverage, and insufficient evaluative capacity. Here, we present DeepInnovationAI, a comprehensive global dataset containing three structured files. DeepPatentAI.csv: Contains 2,356,204 patent records with 8 field-specific attributes. DeepDiveAI.csv: Encompasses 3,511,929 academic publications with 13 metadata fields. These two datasets leverage large language models, multilingual text analysis and dual-layer BERT classifiers to accurately identify AI-related content, while utilizing hypergraph analysis to create robust innovation metrics. Additionally, DeepCosineAI.csv: By applying semantic vector proximity analysis, this file contains 3,511,929 most relevant paper-patent pairs, each described by 3 metadata fields, to facilitate the identification of potential knowledge flows. DeepInnovationAI enables researchers, policymakers, and industry leaders to anticipate trends and identify collaboration opportunities. With extensive temporal and geographical scope, it supports detailed analysis of technological development patterns and international competition dynamics, establishing a foundation for modeling AI innovation and technology transfer processes.

A Global Dataset Mapping the AI Innovation from Academic Research to Industrial Patents

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理