Fusing Bidirectional Chains of Thought and Reward Mechanisms A Method for Enhancing Question-Answering Capabilities of Large Language Models for Chinese Intangible Cultural Heritage

📄 arXiv: 2505.08167v4 📥 PDF

作者: Ruilin Liu, Zhixiao Zhao, Jieqiong Li, Chang Liu, Dongbo Wang

分类: cs.CL, cs.AI

发布日期: 2025-05-13 (更新: 2025-06-10)

备注: We want to withdraw this paper due to data usage permission issues identified after submission. We discovered that our use of certain intangible cultural heritage materials required additional community permissions and institutional ethical approvals that were not obtained


💡 一句话要点

提出融合双向思维链与奖励机制的方法,提升大语言模型在中文非遗问答中的能力。

🎯 匹配领域: 支柱二:RL算法与架构 (RL & Architecture) 支柱九:具身大模型 (Embodied Foundation Models)

关键词: 大型语言模型 非物质文化遗产 问答系统 双向思维链 奖励机制

📋 核心要点

  1. 现有方法在利用大型语言模型处理非物质文化遗产问答时,面临知识偏差、继承错误和灾难性遗忘等问题。
  2. 论文提出融合双向思维链和奖励机制的训练方法,通过正反向推理激活模型潜在知识,并优化决策过程。
  3. 实验结果表明,该方法在ICH-Qwen模型上,显著提升了问答准确率、Bleu-4和Rouge-L等指标,并具有良好的泛化能力。

📝 摘要(中文)

大型语言模型(LLMs)的快速发展为特定领域LLMs的进步提供了重要的支持和机遇。然而,使用非物质文化遗产(ICH)数据微调这些大型模型不可避免地面临偏差、不正确的知识继承和灾难性遗忘等挑战。为了解决这些问题,我们提出了一种新颖的训练方法,该方法集成了双向思维链和奖励机制。该方法建立在ICH-Qwen之上,ICH-Qwen是专门为非物质文化遗产领域设计的大型语言模型。所提出的方法使模型不仅能够执行前向推理,而且还通过利用反向提问和反向推理来激活模型的潜在知识,从而提高生成答案的准确性。此外,在训练过程中引入奖励机制以优化决策过程。该机制通过具有不同加权方案的结构和内容评估来提高模型输出的质量。我们在ICH-Qwen上进行了对比实验,结果表明,在问答任务中,我们的方法在准确率、Bleu-4和Rouge-L得分方面优于0-shot、逐步推理、知识蒸馏和问题增强方法。此外,本文通过消融实验强调了结合双向思维链和奖励机制的有效性。此外,进行了一系列泛化实验,结果表明,所提出的方法在金融、Wikidata和StrategyQA等领域的各种特定领域数据集和高级模型上产生了改进。这表明该方法适用于多个领域,并为未来跨不同领域的模型训练提供了一种有价值的方法。

🔬 方法详解

问题定义:论文旨在解决大型语言模型在处理中文非物质文化遗产(ICH)问答任务时,由于数据偏差、知识继承错误和灾难性遗忘等问题导致的性能下降。现有方法,如直接微调或简单的数据增强,难以有效解决这些问题,导致模型在特定领域的知识掌握不足,回答准确率不高。

核心思路:论文的核心思路是结合双向思维链(Bidirectional Chains of Thought)和奖励机制(Reward Mechanism),增强模型在推理过程中的知识激活和决策优化能力。双向思维链通过正向推理和反向提问/推理,更全面地挖掘模型内部的知识,减少知识偏差。奖励机制则通过结构和内容评估,引导模型生成更高质量的答案。

技术框架:整体框架基于ICH-Qwen模型,主要包含以下几个阶段:1) 数据准备:构建包含非遗知识的问答数据集。2) 双向思维链训练:利用正向和反向推理链生成答案,增强知识激活。3) 奖励机制训练:根据答案的结构和内容质量,给予模型奖励,优化决策过程。4) 模型评估:在测试集上评估模型的性能。

关键创新:最重要的技术创新点在于融合了双向思维链和奖励机制。双向思维链能够更全面地激活模型内部的知识,而奖励机制则能够引导模型生成更高质量的答案。这种结合克服了传统方法中知识利用不足和答案质量不高的问题。与现有方法相比,该方法不仅关注模型的知识储备,更关注知识的有效利用和答案的质量。

关键设计:在双向思维链中,正向推理采用标准的思维链生成方式,反向推理则通过反向提问或反向推理生成答案。奖励机制的设计包括结构奖励和内容奖励,结构奖励关注答案的完整性和逻辑性,内容奖励关注答案的准确性和相关性。奖励函数采用加权求和的方式,根据不同的任务和数据集调整权重。

📊 实验亮点

实验结果表明,该方法在ICH-Qwen模型上,显著提升了问答准确率、Bleu-4和Rouge-L等指标。相较于0-shot、逐步推理、知识蒸馏和问题增强等基线方法,该方法在准确率上取得了显著提升。消融实验验证了双向思维链和奖励机制的有效性。泛化实验表明,该方法在金融、Wikidata和StrategyQA等领域也取得了改进。

🎯 应用场景

该研究成果可应用于构建智能非物质文化遗产知识库、智能导览系统、教育机器人等。通过提升大语言模型在特定领域的问答能力,可以更好地保护和传承非物质文化遗产,并为相关领域的知识服务提供更智能化的解决方案。此外,该方法具有良好的泛化性,可推广到其他领域知识密集型任务中。

📄 摘要(原文)

The rapid development of large language models (LLMs) has provided significant support and opportunities for the advancement of domain-specific LLMs. However, fine-tuning these large models using Intangible Cultural Heritage (ICH) data inevitably faces challenges such as bias, incorrect knowledge inheritance, and catastrophic forgetting. To address these issues, we propose a novel training method that integrates a bidirectional chains of thought and a reward mechanism. This method is built upon ICH-Qwen, a large language model specifically designed for the field of intangible cultural heritage. The proposed method enables the model to not only perform forward reasoning but also enhances the accuracy of the generated answers by utilizing reverse questioning and reverse reasoning to activate the model's latent knowledge. Additionally, a reward mechanism is introduced during training to optimize the decision-making process. This mechanism improves the quality of the model's outputs through structural and content evaluations with different weighting schemes. We conduct comparative experiments on ICH-Qwen, with results demonstrating that our method outperforms 0-shot, step-by-step reasoning, knowledge distillation, and question augmentation methods in terms of accuracy, Bleu-4, and Rouge-L scores on the question-answering task. Furthermore, the paper highlights the effectiveness of combining the bidirectional chains of thought and reward mechanism through ablation experiments. In addition, a series of generalizability experiments are conducted, with results showing that the proposed method yields improvements on various domain-specific datasets and advanced models in areas such as Finance, Wikidata, and StrategyQA. This demonstrates that the method is adaptable to multiple domains and provides a valuable approach for model training in future applications across diverse fields.