Benchmarking System Dynamics AI Assistants: Cloud Versus Local LLMs on CLD Extraction and Discussion

作者: Terry Leitch

分类: cs.AI, cs.HC, cs.LG

发布日期: 2026-04-20

💡 一句话要点

系统评估云端与本地LLM在系统动力学AI助手中的表现

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 系统动力学 大型语言模型 云计算 本地部署 因果环图 模型讨论 性能评估 量化技术

📋 核心要点

现有方法在系统动力学AI助手中存在性能差异，尤其是在长上下文提示下的错误修正能力不足。
论文通过系统评估不同类型的LLM，比较云端与本地模型在特定任务上的表现，揭示模型类型对性能的影响。
实验结果表明，云端模型在因果环图提取上表现优于本地模型，而在模型讨论中，本地模型在某些步骤上表现良好，但在错误修正上存在显著不足。

📝 摘要（中文）

本文系统评估了大型语言模型（LLM）在系统动力学AI助手中的表现，涵盖了云端API与本地开源模型。研究使用了两个专门构建的基准：CLD Leaderboard（因果环图提取）和Discussion Leaderboard（模型讨论与反馈）。结果显示，云端模型在CLD提取上整体通过率为77-89%，而最佳本地模型为77%。在讨论方面，本地模型在模型构建步骤上表现良好，但在错误修正上表现不佳，显示出本地部署的内存限制。本文的核心贡献在于对模型类型对性能影响的系统分析，比较了不同架构和量化水平的效果。

🔬 方法详解

问题定义：本文旨在解决系统动力学AI助手中不同大型语言模型（LLM）在因果环图提取和模型讨论中的性能差异，尤其是本地模型在长上下文提示下的表现不足。

核心思路：通过系统评估云端与本地模型在特定基准任务上的表现，分析模型类型、架构和量化对性能的影响，以提供更有效的AI助手解决方案。

技术框架：研究采用了两个基准测试：CLD Leaderboard用于因果环图提取，Discussion Leaderboard用于模型讨论。评估包括不同模型架构（如指令调优与推理模型）、后端（GGUF与MLX）和量化水平（如Q3/Q4_K_M等）。

关键创新：本文的创新在于系统分析模型类型对性能的影响，发现后端选择对实际性能的影响大于量化水平，且不同后端在处理JSON时表现差异显著。

关键设计：研究详细记录了所有本地模型的参数设置，清理了时间数据，并提供了在Apple Silicon上运行671B-123B参数模型的实践指南。具体参数设置包括量化级别和模型架构选择。

📊 实验亮点

实验结果显示，云端模型在因果环图提取任务中的整体通过率为77-89%，而最佳本地模型的通过率为77%。在模型讨论中，本地模型在模型构建步骤上表现为50-100%，但在错误修正上仅为0-50%，显示出长上下文提示对本地模型的挑战。

🎯 应用场景

该研究的潜在应用领域包括系统动力学建模、教育培训和决策支持系统。通过优化AI助手的性能，可以提升用户在复杂系统分析中的效率和准确性，推动相关领域的智能化发展。

📄 摘要（原文）

We present a systematic evaluation of large language model families -- spanning both proprietary cloud APIs and locally-hosted open-source models -- on two purpose-built benchmarks for System Dynamics AI assistance: the \textbf{CLD Leaderboard} (53 tests, structured causal loop diagram extraction) and the \textbf{Discussion Leaderboard} (interactive model discussion, feedback explanation, and model building coaching). On CLD extraction, cloud models achieve 77--89\% overall pass rates; the best local model reaches 77\% (Kimi~K2.5~GGUF~Q3, zero-shot engine), matching mid-tier cloud performance. On Discussion, the best local models achieve 50--100\% on model building steps and 47--75\% on feedback explanation, but only 0--50\% on error fixing -- a category dominated by long-context prompts that expose memory limits in local deployments. A central contribution of this paper is a systematic analysis of \textit{model type effects} on performance: we compare reasoning vs.\ instruction-tuned architectures, GGUF (llama.cpp) vs.\ MLX (mlx_lm) backends, and quantization levels (Q3 / Q4_K_M / MLX-3bit / MLX-4bit / MLX-6bit) across the same underlying model families. We find that backend choice has larger practical impact than quantization level: mlx_lm does not enforce JSON schema constraints, requiring explicit prompt-level JSON instructions, while llama.cpp grammar-constrained sampling handles JSON reliably but causes indefinite generation on long-context prompts for dense models. We document the full parameter sweep ($t$, $p$, $k$) for all local models, cleaned timing data (stuck requests excluded), and a practitioner guide for running 671B--123B parameter models on Apple~Silicon.

Benchmarking System Dynamics AI Assistants: Cloud Versus Local LLMs on CLD Extraction and Discussion

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理