CyberThreat-Eval: Can Large Language Models Automate Real-World Threat Research?

作者: Xiangsen Chen, Xuan Feng, Shuo Chen, Matthieu Maitre, Sudipto Rakshit, Diana Duvieilh, Ashley Picone, Nan Tang

分类: cs.CR, cs.CL

发布日期: 2026-03-10

备注: Accepted at TMLR

期刊: Transactions on Machine Learning Research (2025), ISSN 2835-8856

🔗 代码/项目: GITHUB | HUGGINGFACE

💡 一句话要点

提出CyberThreat-Eval以解决现有CTI报告自动化不足问题

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 网络威胁情报 大型语言模型 开源情报 自动化报告 专家注释 评估基准

📋 核心要点

现有的LLMs基准测试未能反映真实的分析师工作流程，且多依赖于模型中心的度量标准。
论文提出CyberThreat-Eval基准，涵盖CTI工作流程的所有三个阶段，并采用分析师中心的度量标准。
评估结果显示，当前LLMs在处理复杂信息时存在显著不足，且难以区分正确信息与错误信息。

📝 摘要（中文）

分析来自大量数据的开源情报（OSINT）对于撰写和发布全面的网络威胁情报（CTI）报告至关重要。该过程通常遵循三阶段工作流程——筛选、深度搜索和TI草拟。尽管大型语言模型（LLMs）提供了自动化的前景，但现有基准测试存在局限性，往往不反映真实分析师的工作流程。为了解决这些问题，我们引入了CyberThreat-Eval，这是基于一家全球领先公司的日常CTI工作流程收集的专家注释基准，评估LLMs在所有三个阶段的实际任务。我们的评估揭示了当前LLMs的局限性，例如缺乏处理复杂细节的细致专业知识。

🔬 方法详解

问题定义：论文旨在解决现有LLMs在网络威胁情报（CTI）自动化中的不足，尤其是现有基准测试未能真实反映分析师的工作流程，且缺乏有效的评估标准。

核心思路：通过引入CyberThreat-Eval基准，论文提供了一种新的评估方法，专注于实际任务和分析师的需求，强调事实准确性、内容质量和操作成本。

技术框架：整体架构包括三个主要阶段：筛选、深度搜索和TI草拟。每个阶段都使用专家注释的数据进行评估，并结合外部真实数据库和人类专家知识。

关键创新：最重要的创新在于引入了分析师中心的度量标准，强调了对实际操作的关注，而非单纯的词汇重叠。此外，CTI工作流程的设计允许人类专家进行反馈，促进持续改进。

关键设计：在设计中，采用了多种评估指标来衡量模型的表现，包括事实准确性和内容质量，确保模型输出的实用性和可靠性。

🖼️ 关键图片

📊 实验亮点

实验结果表明，当前LLMs在处理复杂信息时存在显著不足，尤其是在区分正确信息与错误信息方面。通过CyberThreat-Eval基准的评估，发现LLMs的表现与人类专家相比仍有较大差距，强调了进一步改进的必要性。

🎯 应用场景

该研究的潜在应用领域包括网络安全、情报分析和自动化报告生成等。通过提高LLMs在CTI领域的表现，能够显著提升网络威胁检测和响应的效率，进而增强整体网络安全防护能力。

📄 摘要（原文）

Analyzing Open Source Intelligence (OSINT) from large volumes of data is critical for drafting and publishing comprehensive CTI reports. This process usually follows a three-stage workflow -- triage, deep search and TI drafting. While Large Language Models (LLMs) offer a promising route toward automation, existing benchmarks still have limitations. These benchmarks often consist of tasks that do not reflect real-world analyst workflows. For example, human analysts rarely receive tasks in the form of multiple-choice questions. Also, existing benchmarks often rely on model-centric metrics that emphasize lexical overlap rather than actionable, detailed insights essential for security analysts. Moreover, they typically fail to cover the complete three-stage workflow. To address these issues, we introduce CyberThreat-Eval, which is collected from the daily CTI workflow of a world-leading company. This expert-annotated benchmark assesses LLMs on practical tasks across all three stages as mentioned above. It utilizes analyst-centric metrics that measure factual accuracy, content quality, and operational costs. Our evaluation using this benchmark reveals important insights into the limitations of current LLMs. For example, LLMs often lack the nuanced expertise required to handle complex details and struggle to distinguish between correct and incorrect information. To address these challenges, the CTI workflow incorporates both external ground-truth databases and human expert knowledge. TRA allows human experts to iteratively provide feedback for continuous improvement. The code is available at \href{https://github.com/xschen-beb/CyberThreat-Eval}{\texttt{GitHub}} and \href{https://huggingface.co/datasets/xse/CyberThreat-Eval}{\texttt{HuggingFace}}.

CyberThreat-Eval: Can Large Language Models Automate Real-World Threat Research?

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理