Red Lines and Grey Zones in the Fog of War: Benchmarking Legal Risk, Moral Harm, and Regional Bias in Large Language Model Military Decision-Making

作者: Toby Drinkall

分类: cs.CY, cs.AI, cs.CL

发布日期: 2025-10-03

备注: 54 pages; 11 figures

💡 一句话要点

提出基准框架以评估军事决策中的法律风险与道德伤害

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 大型语言模型 军事决策 法律风险 道德伤害 国际人道法 基准框架 模拟冲突 目标选择

📋 核心要点

现有的军事决策支持系统在整合大型语言模型时面临法律和道德风险评估的挑战。
本研究提出了一个基准框架，通过四个指标评估LLMs在模拟冲突中的目标行为，确保符合国际人道法。
实验结果显示，所评估的模型在模拟冲突中表现出显著的目标选择差异，违反法律原则的情况普遍存在。

📝 摘要（中文）

随着军事组织考虑将大型语言模型（LLMs）整合到指挥与控制系统中，理解其行为倾向至关重要。本研究开发了一个基准框架，用于评估多轮模拟冲突中LLMs作为代理的目标行为的法律和道德风险。我们引入了四个基于国际人道法（IHL）和军事原则的指标：平民目标率（CTR）和双重用途目标率（DTR）评估法律合规性，而模拟非战斗人员伤亡值（SNCV）的均值和最大值则量化对平民伤害的容忍度。通过对三种前沿模型（GPT-4o、Gemini-2.5和LLaMA-3.1）进行90次多代理、多轮危机模拟的评估，发现现成的LLMs在模拟冲突环境中表现出令人担忧和不可预测的目标行为。所有模型均违反了IHL的区分原则，平民目标的选择率在16.7%到66.7%之间。此项工作旨在提供LLMs在决策支持系统中可能出现的行为风险的概念验证，并建立可重复的基准框架以标准化预部署测试。

🔬 方法详解

问题定义：本研究旨在解决大型语言模型在军事决策中可能带来的法律和道德风险，现有方法缺乏系统性评估这些风险的框架。

核心思路：通过建立一个基准框架，结合国际人道法的原则，评估LLMs在多轮模拟冲突中的目标选择行为，以量化其法律合规性和对平民伤害的容忍度。

技术框架：研究设计了四个评估指标，包括平民目标率（CTR）、双重用途目标率（DTR）、模拟非战斗人员伤亡值的均值和最大值，构成了评估的核心模块。

关键创新：本研究的创新在于提出了一个系统化的评估框架，能够量化LLMs的目标选择行为与国际人道法的合规性，填补了现有研究的空白。

关键设计：在实验中，使用了三种不同的前沿模型，并通过90次多轮模拟进行评估，确保了结果的可靠性和可重复性。

🖼️ 关键图片

📊 实验亮点

实验结果显示，所有模型在模拟冲突中均违反了国际人道法的区分原则，平民目标选择率在16.7%至66.7%之间，且模型间的选择差异显著，LLaMA-3.1的平均平民打击次数为3.47，而Gemini-2.5为0.90，表明模型选择直接影响法律与道德风险的接受程度。

🎯 应用场景

该研究的潜在应用领域包括军事决策支持系统的开发与评估，能够帮助军事组织在部署大型语言模型时更好地理解和管理法律与道德风险，提升决策的合规性与人道性。

📄 摘要（原文）

As military organisations consider integrating large language models (LLMs) into command and control (C2) systems for planning and decision support, understanding their behavioural tendencies is critical. This study develops a benchmarking framework for evaluating aspects of legal and moral risk in targeting behaviour by comparing LLMs acting as agents in multi-turn simulated conflict. We introduce four metrics grounded in International Humanitarian Law (IHL) and military doctrine: Civilian Target Rate (CTR) and Dual-use Target Rate (DTR) assess compliance with legal targeting principles, while Mean and Max Simulated Non-combatant Casualty Value (SNCV) quantify tolerance for civilian harm. We evaluate three frontier models, GPT-4o, Gemini-2.5, and LLaMA-3.1, through 90 multi-agent, multi-turn crisis simulations across three geographic regions. Our findings reveal that off-the-shelf LLMs exhibit concerning and unpredictable targeting behaviour in simulated conflict environments. All models violated the IHL principle of distinction by targeting civilian objects, with breach rates ranging from 16.7% to 66.7%. Harm tolerance escalated through crisis simulations with MeanSNCV increasing from 16.5 in early turns to 27.7 in late turns. Significant inter-model variation emerged: LLaMA-3.1 selected an average of 3.47 civilian strikes per simulation with MeanSNCV of 28.4, while Gemini-2.5 selected 0.90 civilian strikes with MeanSNCV of 17.6. These differences indicate that model selection for deployment constitutes a choice about acceptable legal and moral risk profiles in military operations. This work seeks to provide a proof-of-concept of potential behavioural risks that could emerge from the use of LLMs in Decision Support Systems (AI DSS) as well as a reproducible benchmarking framework with interpretable metrics for standardising pre-deployment testing.

Red Lines and Grey Zones in the Fog of War: Benchmarking Legal Risk, Moral Harm, and Regional Bias in Large Language Model Military Decision-Making

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理