Information Extraction from Heterogeneous Documents without Ground Truth Labels using Synthetic Label Generation and Knowledge Distillation

作者: Aniket Bhattacharyya, Anurag Tripathi

分类: cs.CL

发布日期: 2024-11-22 (更新: 2024-11-25)

备注: Accepted to WACV 2025

💡 一句话要点

提出TAIL方法，结合合成标签生成与知识蒸馏，解决异构文档信息抽取难题。

🎯 匹配领域: 支柱二：RL算法与架构 (RL & Architecture) 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 信息抽取 异构文档 合成标签生成 知识蒸馏 视觉丰富文档理解 多模态学习 任务指令

📋 核心要点

现有方法难以有效处理格式多样、质量不一且缺乏标注的视觉丰富文档信息抽取任务。
提出TAIL方法，利用任务相关的指令生成合成标签，并采用知识蒸馏提升模型性能。
实验表明，该方法在成本和速度上优于 Claude 3 Sonnet，且在ANLS指标上超越布局感知基线。

📝 摘要（中文）

本文提出了一种名为Task Aware Instruction-based Labelling (TAIL) 的方法，用于在没有标签的视觉丰富文档 (VRD) 语料库中生成合成标签。该方法通过指令驱动的方式生成标签，并使用基于响应的知识蒸馏，在不使用教师模型权重或训练数据集的情况下，对多模态视觉丰富文档理解模型 (VRDU) 进行微调，以有条件地生成适当格式的注释。在有真实标签的基准外部数据集上，经验研究表明，该方法在特定条件下与 Claude 3 Sonnet 性能相当。在大型跨国组织的内部费用文档上，该模型性能与最先进的大型多模态模型 (LMM) Claude 3 Sonnet 相当或更好，但成本降低 85%，速度提高约 5 倍，并且由于其能够推理和提取罕见格式的信息，在平均归一化莱文斯坦相似度 (ANLS) 分数方面，优于布局感知基线超过 10%。最后，展示了该方法在预防超额支付方面的应用。

🔬 方法详解

问题定义：论文旨在解决从异构的视觉丰富文档（如发票和收据）中提取信息的难题。这些文档格式多样、语言各异、图像质量参差不齐，且通常缺乏标注数据，使得模型训练面临挑战。现有方法难以有效应对这种复杂性，需要人工标注成本高昂。

核心思路：论文的核心思路是利用任务相关的指令生成合成标签，从而避免人工标注的需要。然后，通过知识蒸馏，将大型模型的知识迁移到小型模型，在保证性能的同时降低计算成本。这种方法的核心在于利用大型语言模型（LLM）的强大生成能力，以及知识蒸馏的效率。

技术框架：整体框架包含两个主要阶段：1) 合成标签生成阶段：使用TAIL方法，根据任务需求生成相应的标签。具体来说，就是利用LLM，输入文档图像和任务指令，让LLM生成标注。2) 模型训练阶段：使用生成的合成标签，对VRDU模型进行微调。采用基于响应的知识蒸馏，使用大型模型（教师模型）的输出作为目标，训练小型模型（学生模型）。

关键创新：最重要的创新点在于TAIL方法，它能够根据任务需求，自动生成高质量的合成标签。与传统的数据增强方法不同，TAIL方法能够生成更具语义信息的标签，从而更好地指导模型训练。此外，使用基于响应的知识蒸馏，避免了直接使用教师模型的权重，保护了模型的隐私。

关键设计：TAIL方法的关键在于指令的设计，需要根据具体的任务进行调整，以生成最合适的标签。知识蒸馏过程中，损失函数的设计也很重要，需要平衡学生模型与教师模型之间的差异，以及学生模型自身的性能。论文中具体使用的VRDU模型结构和参数设置未知。

🖼️ 关键图片

📊 实验亮点

实验结果表明，在内部费用文档上，该方法性能与 Claude 3 Sonnet 相当或更好，但成本降低 85%，速度提高约 5 倍。在平均归一化莱文斯坦相似度 (ANLS) 分数方面，优于布局感知基线超过 10%。这些数据表明，该方法在性能、成本和效率方面都具有显著优势。

🎯 应用场景

该研究成果可广泛应用于财务报销自动化、文档审核、欺诈检测等领域。通过自动提取发票、收据等文档中的关键信息，可以提高工作效率，降低人工成本，并有效预防财务风险。未来，该技术有望扩展到更多类型的文档处理场景，例如合同管理、法律文书分析等。

📄 摘要（原文）

Invoices and receipts submitted by employees are visually rich documents (VRDs) with textual, visual and layout information. To protect against the risk of fraud and abuse, it is crucial for organizations to efficiently extract desired information from submitted receipts. This helps in the assessment of key factors such as appropriateness of the expense claim, adherence to spending and transaction policies, the validity of the receipt, as well as downstream anomaly detection at various levels. These documents are heterogeneous, with multiple formats and languages, uploaded with different image qualities, and often do not contain ground truth labels for the efficient training of models. In this paper we propose Task Aware Instruction-based Labelling (TAIL), a method for synthetic label generation in VRD corpuses without labels, and fine-tune a multimodal Visually Rich Document Understanding Model (VRDU) on TAIL labels using response-based knowledge distillation without using the teacher model's weights or training dataset to conditionally generate annotations in the appropriate format. Using a benchmark external dataset where ground truth labels are available, we demonstrate conditions under which our approach performs at par with Claude 3 Sonnet through empirical studies. We then show that the resulting model performs at par or better on the internal expense documents of a large multinational organization than state-of-the-art LMM (large multimodal model) Claude 3 Sonnet while being 85% less costly and ~5X faster, and outperforms layout-aware baselines by more than 10% in Average Normalized Levenshtein Similarity (ANLS) scores due to its ability to reason and extract information from rare formats. Finally, we illustrate the usage of our approach in overpayment prevention.

Information Extraction from Heterogeneous Documents without Ground Truth Labels using Synthetic Label Generation and Knowledge Distillation

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理