DataDignity: Training Data Attribution for Large Language Models

作者: Xiaomin Li, Andrzej Banburski-Fahey, Jaron Lanier

分类: cs.AI

发布日期: 2026-05-07

💡 一句话要点

提出DataDignity框架与FakeWiki基准，通过监督对比学习实现大语言模型训练数据溯源

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 大语言模型 训练数据溯源 对比学习 信息检索 模型审计 可解释性AI

📋 核心要点

现有检索方法常受限于词汇重叠，难以区分真正支持事实的文档与仅在主题上相似的干扰项。
提出FakeWiki基准与ScoringModel，通过监督对比学习将响应与文档映射至共享特征空间，实现精准溯源。
实验显示ScoringModel在多种模型与复杂查询下大幅提升Recall@10，证明了其在复杂语义环境下的溯源能力。

📝 摘要（中文）

大语言模型（LLM）的输出审计不仅需要评估正确性，还需识别支持该知识的原始文档。本文将此问题定义为“精准溯源”（pinpoint provenance），即给定提示词、模型响应和候选语料库，对支持响应的文档进行排序。作者构建了FakeWiki基准，包含3,537篇伪造的维基百科风格文章，通过QA探针、释义、反事实文档及多种查询变换，有效削弱了词汇重叠带来的干扰。研究提出了SteerFuse（无需训练的激活引导检索融合）和ScoringModel（监督对比溯源排序器）。实验表明，ScoringModel在九种指令微调模型及多种查询条件下，将Recall@10从基线的35.0提升至52.2，证明了在区分真实事实支持与表面词汇相似性方面，该方法具有显著的鲁棒性。

🔬 方法详解

问题定义：论文旨在解决大语言模型输出的“精准溯源”问题，即在给定候选语料库的情况下，识别出真正支撑模型生成结果的原始文档，而非仅仅检索主题相关的文本。

核心思路：核心在于构建一个能够区分“事实支持”与“词汇相似”的排序模型。通过引入FakeWiki这一受控基准，强制模型学习事实层面的关联，而非依赖表层的词汇匹配。

技术框架：整体包含两个主要方案：一是SteerFuse，利用模型激活空间的信息进行无监督检索融合；二是ScoringModel，这是一个监督对比学习框架，将响应和文档编码至共享特征空间，通过对比学习优化排序性能。

关键创新：引入了包含“反文档”（anti-documents）的训练策略，这些文档在主题上高度相似但剔除了关键事实，迫使模型学习细粒度的证据匹配，而非简单的语义相似度匹配。

关键设计：ScoringModel采用InfoNCE损失函数进行训练，并结合了In-batch负采样、检索挖掘负采样以及反文档负采样，确保模型在处理复杂查询（如越狱指令）时仍能保持高精度的溯源能力。

🖼️ 关键图片

📊 实验亮点

ScoringModel在九种开源LLM及五种查询条件下表现优异，将Recall@10从最强基线的35.0提升至52.2，在45个测试单元中胜出41个。特别是在越狱风格的复杂查询下，该模型平均Recall@10提升了15.7个百分点，显著优于传统的检索增强方法。

🎯 应用场景

该研究在AI透明度、版权保护与事实核查领域具有重要价值。它可用于构建可信的LLM审计系统，帮助开发者追踪模型知识来源，识别训练数据中的偏见或侵权内容，并为AI生成内容的溯源提供技术支撑，增强用户对模型输出的信任度。

📄 摘要（原文）

Auditing language-model outputs often requires more than judging correctness: an auditor may need to identify which source document most likely supports the knowledge expressed in a response. We study this as pinpoint provenance: given a prompt, a target-model response, and a candidate corpus, rank the documents that best support the response. We introduce FakeWiki, a controlled benchmark of 3,537 fabricated Wikipedia-style articles designed to preserve ground-truth provenance while weakening lexical shortcuts. FakeWiki includes QA probes, source-preserving paraphrases, retro-generated variants, hard anti-documents that remain topically similar while removing answer-critical facts, and five query conditions: clean prompting plus four jailbreak-inspired transformations. We evaluate seven retrieval baselines, a training-free activation-steering retrieval-fusion method, SteerFuse, and a supervised contrastive provenance ranker, ScoringModel. ScoringModel maps response and document features into a shared space and is trained with InfoNCE using in-batch, retrieval-mined, and anti-document negatives. Across nine open-weight instruction-tuned LLMs and five query conditions, ScoringModel improves mean Recall@10 from 35.0 for the strongest retrieval baseline to 52.2, without inference-time fusion, and wins 41/45 model-by-condition cells. SteerFuse is usually second-best despite requiring no supervised training, showing that activation-space evidence can efficiently complement text retrieval. On jailbreak-inspired transformed queries, ScoringModel improves Recall@10 by 15.7 points on average over the best baseline. Overall, our work shows that robust training data attribution requires evaluation settings that separate true answer support from topical or lexical resemblance.

DataDignity: Training Data Attribution for Large Language Models

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理