CritBench: A Framework for Evaluating Cybersecurity Capabilities of Large Language Models in IEC 61850 Digital Substation Environments

作者: Gustav Keppler, Moritz Gstür, Veit Hagenmeyer

分类: cs.CR, cs.AI

发布日期: 2026-04-07

备注: 16 pages, 4 figures, 3 tables. Submitted to the 3rd ACM SIGEnergy Workshop on Cybersecurity and Privacy of Energy Systems (ACM EnergySP '26)

🔗 代码/项目: GITHUB

💡 一句话要点

提出CritBench框架以评估IEC 61850数字变电站环境中的网络安全能力

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 网络安全 大型语言模型 IEC 61850 数字变电站 操作技术 评估框架 动态任务 工具框架

📋 核心要点

现有的评估框架主要集中在信息技术环境，未能有效应对操作技术中的专用协议和约束，导致评估结果的局限性。
本文提出CritBench框架，专门设计用于评估IEC 61850数字变电站环境中大型语言模型的网络安全能力，填补现有研究的空白。
实验结果显示，LLM代理在静态任务中表现良好，但在动态任务中性能下降，通过引入领域特定工具框架显著改善了这一问题。

📝 摘要（中文）

大型语言模型（LLMs）的进步引发了对其在网络安全领域双重用途潜力的担忧。现有评估框架主要集中在信息技术（IT）环境，未能捕捉到操作技术（OT）的约束和专用协议。为填补这一空白，本文提出了CritBench，一个新颖的框架，旨在评估LLM代理在IEC 61850数字变电站环境中的网络安全能力。我们评估了包括OpenAI的GPT-5套件在内的五个最先进模型，涵盖81个特定领域任务，涉及静态配置分析、网络流量侦察和实时虚拟机交互。实证结果表明，代理在静态结构文件分析和单工具网络枚举方面表现可靠，但在动态任务中的性能下降。尽管当前模型对IEC 61850标准术语有明确的内化知识，但在没有专用工具的情况下，处理实时系统所需的持续顺序推理和状态跟踪仍然存在困难。通过我们的领域特定工具框架，显著缓解了这一操作瓶颈。

🔬 方法详解

问题定义：本文旨在解决现有评估框架在操作技术环境中对大型语言模型网络安全能力评估的不足，尤其是在IEC 61850数字变电站环境中的应用场景。现有方法主要集中于信息技术环境，未能考虑到操作技术的专用协议和约束，导致评估结果的局限性。

核心思路：CritBench框架的核心思想是针对IEC 61850标准设计一个专用的评估工具，能够有效评估LLM在特定领域任务中的表现。通过构建一个领域特定的工具框架，增强模型在动态任务中的能力，提升其在实际应用中的有效性。

技术框架：CritBench框架包括多个主要模块，首先是任务定义模块，涵盖静态配置分析、网络流量侦察和实时虚拟机交互等任务；其次是模型评估模块，针对不同模型进行性能评估；最后是工具框架模块，提供与工业协议的交互能力。

关键创新：CritBench的主要创新在于其针对IEC 61850标准的专用评估工具和框架设计，填补了现有评估方法在操作技术领域的空白，提供了更为精准的评估手段。与现有方法相比，CritBench能够更好地适应特定领域的需求。

关键设计：在设计中，CritBench采用了领域特定的工具框架，增强了模型在动态任务中的表现。具体参数设置和损失函数的选择尚未详细披露，未来研究可进一步探讨这些技术细节。

🖼️ 关键图片

📊 实验亮点

实验结果表明，LLM代理在静态结构文件分析和单工具网络枚举任务中表现可靠，成功率高达80%以上。然而，在动态任务中，性能有所下降，显示出模型在持续顺序推理和状态跟踪方面的不足。引入领域特定工具框架后，模型的动态任务表现显著改善，操作瓶颈得到有效缓解。

🎯 应用场景

CritBench框架的潜在应用领域包括电力行业的数字变电站、智能电网以及其他需要高安全性和可靠性的操作技术环境。通过评估大型语言模型在这些领域的网络安全能力，能够为实际应用提供更为可靠的技术支持，提升系统的安全性和稳定性。未来，该框架可能会影响网络安全领域的标准制定和模型开发方向。

📄 摘要（原文）

The advancement of Large Language Models (LLMs) has raised concerns regarding their dual-use potential in cybersecurity. Existing evaluation frameworks overwhelmingly focus on Information Technology (IT) environments, failing to capture the constraints, and specialized protocols of Operational Technology (OT). To address this gap, we introduce CritBench, a novel framework designed to evaluate the cybersecurity capabilities of LLM agents within IEC 61850 Digital Substation environments. We assess five state-of-the-art models, including OpenAI's GPT-5 suite and open-weight models, across a corpus of 81 domain-specific tasks spanning static configuration analysis, network traffic reconnaissance, and live virtual machine interaction. To facilitate industrial protocol interaction, we develop a domain-specific tool scaffold. Our empirical results show that agents reliably execute static structured-file analysis and single-tool network enumeration, but their performance degrades on dynamic tasks. Despite demonstrating explicit, internalized knowledge of the IEC 61850 standards terminology, current models struggle with the persistent sequential reasoning and state tracking required to manipulate live systems without specialized tools. Equipping agents with our domain-specific tool scaffold significantly mitigates this operational bottleneck. Code and evaluation scripts are available at: https://github.com/GKeppler/CritBench

CritBench: A Framework for Evaluating Cybersecurity Capabilities of Large Language Models in IEC 61850 Digital Substation Environments

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理