Cited but Not Verified: Parsing and Evaluating Source Attribution in LLM Deep Research Agents

作者: Hailey Onweller, Elias Lumer, Austin Huber, Pia Ramchandani, Vamse Kumar Subbiah, Corey Feld

分类: cs.CL

发布日期: 2026-05-07

💡 一句话要点

提出首个LLM深度研究代理引用评估框架，揭示了引用质量与事实准确性之间的严重脱节。

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 大语言模型 深度研究代理 源归因评估 检索增强生成 事实核查 幻觉检测 自动化评估

📋 核心要点

现有研究代理缺乏对引用来源的有效验证机制，导致模型生成的引用存在幻觉，且无法确保来源的可访问性与事实一致性。
论文构建了首个基于AST解析器的源归因评估框架，通过闭环检索机制，从链接有效性、相关性及事实准确性三个维度量化评估引用质量。
实验表明，尽管前沿模型在链接有效性上表现优异，但事实准确性普遍较低，且随着检索深度增加，事实准确性反而出现显著下滑。

📝 摘要（中文）

大型语言模型（LLM）驱动的深度研究代理能够将数百个网络来源合成为带引用的报告，但这些引用往往无法可靠验证。现有方法要么盲目信任模型的自引用，要么依赖无法验证来源可访问性、相关性或事实一致性的检索增强生成（RAG）。本文提出了首个源归因评估框架，利用可复现的AST解析器大规模提取并评估LLM生成的Markdown报告中的内联引用。该框架通过检索实际引用的内容，使人类或模型评估者能够从链接有效性、内容相关性和事实准确性三个维度对引用进行验证。研究对14个闭源和开源模型进行了基准测试，结果显示尽管模型在链接有效性上表现良好，但在事实准确性上仅达到39-77%，且随着检索规模的扩大，事实准确性显著下降，揭示了表面引用质量与事实可靠性之间的关键脱节。

🔬 方法详解

问题定义：当前LLM深度研究代理在生成带引用报告时，缺乏对引用来源的系统性验证。现有方法要么依赖模型自评估（易产生偏差），要么依赖不具备事实核查能力的RAG流程，导致“引用了但未验证”的现象普遍存在。

核心思路：论文提出构建一个闭环评估框架，通过解析Markdown报告中的引用结构，强制检索原始来源，并利用经过人类校准的“LLM-as-a-judge”评估器，对引用进行多维度的客观验证，从而量化模型的事实忠实度。

技术框架：框架包含三个核心模块：1. AST解析器，用于从Markdown报告中提取结构化引用；2. 来源检索模块，负责获取引用的原始网页内容；3. 三维评估器，分别针对链接可访问性、内容相关性以及事实准确性进行打分。

关键创新：该研究首次实现了对LLM生成报告中引用的自动化、大规模、多维度验证。其本质区别在于将“引用验证”从单纯的文本匹配提升为基于原始来源内容的逻辑一致性核查。

关键设计：采用基于Rubric（评分准则）的LLM评估器，并通过人类标注数据进行校准，确保评估结果的鲁棒性。同时，通过消融实验分析了检索深度（Tool Calls数量）对事实准确性的影响，揭示了检索规模与事实可靠性之间的非线性关系。

🖼️ 关键图片

📊 实验亮点

研究对14个主流模型进行基准测试，发现前沿模型链接有效性超94%，但事实准确性仅为39-77%。消融实验揭示，当工具调用次数从2次增加到150次时，事实准确性平均下降约42%，证明了单纯增加检索量并不能提升事实可靠性，反而可能引入更多噪声。

🎯 应用场景

该框架可广泛应用于自动化科研辅助工具、企业级知识管理系统及新闻事实核查平台。通过集成该评估机制，开发者能够实时监控研究代理的输出质量，显著降低AI生成内容中的事实幻觉，提升AI在法律、医疗及学术研究等高风险领域的应用可信度。

📄 摘要（原文）

Large language models (LLMs) power deep research agents that synthesize information from hundreds of web sources into cited reports, yet these citations cannot be reliably verified. Current approaches either trust models to self-cite accurately, risking bias, or employ retrieval-augmented generation (RAG) that does not validate source accessibility, relevance, or factual consistency. We introduce the first source attribution evaluation framework that uses a reproducible AST parser to extract and evaluate inline citations from LLM-generated Markdown reports at scale. Unlike methods that verify claims in isolation, our framework closes the loop by retrieving the actual cited content, enabling human or model evaluators to judge each citation against its source. Citations are evaluated along three dimensions. (1) Link Works verifies URL accessibility, (2) Relevant Content measures topical alignment, and (3) Fact Check validates factual accuracy against source content. We benchmark 14 closed-source and open-source LLMs across three evaluation dimensions using rubric-based LLM-as-a-judge evaluators calibrated through human review. Our results reveal that even the strongest frontier models maintain link validity above 94% and relevance above 80%, yet achieve only 39-77% factual accuracy, while fewer than half of open-source models successfully generate cited reports in a one-shot setting. Ablation studies on research depth show that Fact Check accuracy drops by approximately 42% on average across two frontier models as tool calls scale from 2 to 150, demonstrating that more retrieval does not produce more accurate citations. These findings reveal a critical disconnect between surface-level citation quality and factual reliability, and our framework provides the evaluation infrastructure to assess the disconnect.

Cited but Not Verified: Parsing and Evaluating Source Attribution in LLM Deep Research Agents

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理