FEABench: Evaluating Language Models on Multiphysics Reasoning Ability

作者: Nayantara Mudur, Hao Cui, Subhashini Venugopalan, Paul Raccuglia, Michael P. Brenner, Peter Norgaard

分类: cs.AI, cs.CL, math.NA

发布日期: 2025-04-08

备注: 39 pages. Accepted at the NeurIPS 2024 Workshops on Mathematical Reasoning and AI and Open-World Agents

🔗 代码/项目: GITHUB

💡 一句话要点

提出FEABench以评估语言模型在多物理场推理能力上的表现

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 语言模型 有限元分析 多物理场 自动化工程 数值求解 API交互 推理能力 基准测试

📋 核心要点

核心问题：现有方法在解决复杂物理和工程问题时缺乏有效的自动化能力，尤其是在与数值求解器的交互上。
方法要点：提出FEABench基准，通过自然语言推理和与FEA软件的API交互，评估LLMs的解决能力。
实验或效果：最佳策略生成可执行API调用的成功率达到88%，显示出LLMs在工程自动化中的潜力。

📝 摘要（中文）

构建精确的现实世界模拟并调用数值求解器以解决定量问题是工程和科学中的基本要求。本文提出FEABench，一个基准测试，用于评估大型语言模型（LLMs）及其代理在使用有限元分析（FEA）模拟和解决物理、数学和工程问题的能力。我们引入了一种全面的评估方案，调查LLMs通过自然语言问题描述进行端到端推理并操作COMSOL Multiphysics$^ extcircled{R}$软件计算答案的能力。此外，我们设计了一种语言模型代理，能够通过其应用程序接口（API）与软件交互，检查输出并利用工具在多个迭代中改进解决方案。我们最佳的策略生成可执行API调用的成功率达到88%。能够成功与FEA软件交互并解决基准测试中的问题的LLMs，将推动工程自动化的前沿。获得这种能力将增强LLMs的推理能力，并推动能够解决现实世界复杂问题的自主系统的发展。

🔬 方法详解

问题定义：本文旨在解决大型语言模型在多物理场问题求解中的自动化能力不足，尤其是在与数值求解器的有效交互方面。现有方法往往无法实现端到端的解决方案，导致效率低下。

核心思路：通过引入FEABench基准，结合自然语言推理与有限元分析软件的API交互，论文提出了一种新的评估框架，旨在提升LLMs在复杂工程问题求解中的表现。

技术框架：整体架构包括自然语言问题解析、与COMSOL Multiphysics软件的API交互、输出结果的检查与迭代改进等主要模块。该框架支持LLMs在多轮交互中不断优化解决方案。

关键创新：最重要的技术创新在于设计了一种能够与FEA软件进行有效交互的语言模型代理，使得LLMs不仅能理解问题，还能通过API调用直接进行数值计算，这在现有方法中尚属首次。

关键设计：在设计中，采用了特定的参数设置以优化API调用的生成，损失函数则侧重于提高生成调用的成功率，网络结构经过调整以适应多轮交互的需求。具体细节包括对API调用的语义理解和上下文管理。

🖼️ 关键图片

📊 实验亮点

实验结果显示，最佳策略生成可执行API调用的成功率达到88%，显著提升了LLMs在多物理场问题求解中的有效性。这一成果表明，LLMs能够在与FEA软件的交互中实现高效的自动化，推动工程领域的技术进步。

🎯 应用场景

该研究的潜在应用领域包括工程设计、科学计算和自动化系统开发。通过提升语言模型在复杂物理问题求解中的能力，FEABench可以推动智能工程软件的发展，促进更高效的设计与分析流程，最终实现更广泛的工程自动化。未来，具备此能力的自主系统将能够在实际应用中解决更复杂的工程挑战。

📄 摘要（原文）

Building precise simulations of the real world and invoking numerical solvers to answer quantitative problems is an essential requirement in engineering and science. We present FEABench, a benchmark to evaluate the ability of large language models (LLMs) and LLM agents to simulate and solve physics, mathematics and engineering problems using finite element analysis (FEA). We introduce a comprehensive evaluation scheme to investigate the ability of LLMs to solve these problems end-to-end by reasoning over natural language problem descriptions and operating COMSOL Multiphysics$^\circledR$, an FEA software, to compute the answers. We additionally design a language model agent equipped with the ability to interact with the software through its Application Programming Interface (API), examine its outputs and use tools to improve its solutions over multiple iterations. Our best performing strategy generates executable API calls 88% of the time. LLMs that can successfully interact with and operate FEA software to solve problems such as those in our benchmark would push the frontiers of automation in engineering. Acquiring this capability would augment LLMs' reasoning skills with the precision of numerical solvers and advance the development of autonomous systems that can tackle complex problems in the real world. The code is available at https://github.com/google/feabench

FEABench: Evaluating Language Models on Multiphysics Reasoning Ability

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理