Evaluating Hydro-Science and Engineering Knowledge of Large Language Models

作者: Shiruo Hu, Wenbo Shan, Yingjia Li, Zhiqi Wan, Xinpeng Yu, Yunjia Qi, Haotian Xia, Yang Xiao, Dingxiao Liu, Jiaru Wang, Chenxu Gong, Ruixi Zhang, Shuyue Wu, Shibo Cui, Chee Hui Lai, Wei Luo, Yubin He, Bin Xu, Jianshi Zhao

分类: cs.CL

发布日期: 2025-12-03

备注: Hydro-SE Bench sets a new benchmark for the evaluation of LLMs in the Hydro-Science and Engineering domain, with its code and data available at \url{https://github.com/sheishijun/Hydro-SE-Bench}

💡 一句话要点

提出Hydro-SE Bench评估水科学与工程领域大语言模型的知识和应用能力

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 大语言模型 水科学与工程 评估基准 Hydro-SE Bench 领域知识

📋 核心要点

现有的大语言模型在水科学与工程领域的知识和应用能力评估不足，阻碍了其在该领域的有效应用。
论文提出了Hydro-SE Bench评估基准，包含4000道选择题，覆盖九个子领域，用于全面评估LLMs在水科学与工程领域的知识和能力。
实验结果表明，商业LLMs在Hydro-SE Bench上表现优于小参数LLMs，但在特定领域知识方面仍存在不足，模型规模扩大主要提升推理和计算能力。

📝 摘要（中文）

水科学与工程（Hydro-SE）是至关重要的领域，它保障人类供水、生产清洁水电能源并减轻洪涝和干旱灾害。Hydro-SE具有多重工程目标，是一个固有的跨学科领域，整合了科学知识和工程专业知识。这种整合需要广泛的专家协作进行决策，这对智能提出了挑战。随着大型语言模型（LLMs）的快速发展，其在Hydro-SE领域的潜在应用正被越来越多地探索。然而，LLMs在Hydro-SE中的知识和应用能力尚未得到充分评估。为了解决这个问题，我们提出了Hydro-SE LLM评估基准（Hydro-SE Bench），其中包含4,000道选择题。Hydro-SE Bench涵盖九个子领域，并能够评估LLMs在基本概念知识、工程应用能力以及推理和计算能力方面的表现。评估结果表明，商业LLMs的准确率在0.74到0.80之间，而小参数LLMs的准确率在0.41到0.68之间。虽然LLMs在与自然科学和物理科学密切相关的子领域表现良好，但它们在行业标准和水工建筑物等特定领域知识方面存在不足。模型规模的扩大主要提高了推理和计算能力，但LLMs在更好地处理实际工程应用问题方面仍有很大的潜力。这项研究突出了LLMs在Hydro-SE任务中的优势和劣势，为模型开发者提供了明确的训练目标，并为Hydro-SE研究人员提供了应用LLMs的实用指导。

🔬 方法详解

问题定义：论文旨在解决大型语言模型（LLMs）在水科学与工程（Hydro-SE）领域知识和应用能力评估不足的问题。现有方法缺乏针对Hydro-SE领域的专门评估基准，无法准确衡量LLMs在该领域的表现，阻碍了LLMs在该领域的有效应用和进一步发展。现有方法难以区分LLMs在不同子领域的表现差异，无法为模型开发者提供明确的训练目标。

核心思路：论文的核心思路是构建一个全面的Hydro-SE LLM评估基准（Hydro-SE Bench），该基准包含多个子领域的选择题，能够评估LLMs在基本概念知识、工程应用能力以及推理和计算能力方面的表现。通过对LLMs在Hydro-SE Bench上的评估，可以识别LLMs在该领域的优势和劣势，为模型开发者提供明确的训练目标，并为Hydro-SE研究人员提供应用LLMs的实用指导。

技术框架：Hydro-SE Bench包含4000道选择题，覆盖九个子领域，包括水文学、水力学、水资源工程、水环境工程、水利枢纽工程、河流动力学、海岸工程、港口工程和水信息学。评估过程包括：1) 构建Hydro-SE Bench数据集；2) 选择待评估的LLMs；3) 将Hydro-SE Bench中的问题输入到LLMs中；4) 评估LLMs的答案准确率。

关键创新：该论文的主要创新在于提出了Hydro-SE Bench，这是一个专门针对水科学与工程领域LLM评估的基准。与现有通用LLM评估基准相比，Hydro-SE Bench更具领域针对性，能够更准确地评估LLMs在水科学与工程领域的知识和应用能力。

关键设计：Hydro-SE Bench包含4000道选择题，这些问题涵盖了水科学与工程领域的九个子领域，并能够评估LLMs在基本概念知识、工程应用能力以及推理和计算能力方面的表现。问题的设计考虑了不同难度级别，以全面评估LLMs的能力。评估指标主要为准确率，用于衡量LLMs回答问题的正确程度。

🖼️ 关键图片

fig_0

fig_1

fig_2

📊 实验亮点

实验结果表明，商业LLMs在Hydro-SE Bench上的准确率在0.74到0.80之间，而小参数LLMs的准确率在0.41到0.68之间。LLMs在与自然科学和物理科学密切相关的子领域表现良好，但在行业标准和水工建筑物等特定领域知识方面存在不足。模型规模的扩大主要提高了推理和计算能力。

🎯 应用场景

该研究成果可应用于评估和改进LLMs在水科学与工程领域的应用能力，例如辅助水资源管理、优化水利工程设计、预测洪水和干旱灾害等。通过Hydro-SE Bench，可以筛选出更适合水科学与工程任务的LLMs，并指导模型开发者进行针对性训练，从而提高LLMs在该领域的应用效果，为水科学与工程领域的智能化发展提供支持。

📄 摘要（原文）

Hydro-Science and Engineering (Hydro-SE) is a critical and irreplaceable domain that secures human water supply, generates clean hydropower energy, and mitigates flood and drought disasters. Featuring multiple engineering objectives, Hydro-SE is an inherently interdisciplinary domain that integrates scientific knowledge with engineering expertise. This integration necessitates extensive expert collaboration in decision-making, which poses difficulties for intelligence. With the rapid advancement of large language models (LLMs), their potential application in the Hydro-SE domain is being increasingly explored. However, the knowledge and application abilities of LLMs in Hydro-SE have not been sufficiently evaluated. To address this issue, we propose the Hydro-SE LLM evaluation benchmark (Hydro-SE Bench), which contains 4,000 multiple-choice questions. Hydro-SE Bench covers nine subfields and enables evaluation of LLMs in aspects of basic conceptual knowledge, engineering application ability, and reasoning and calculation ability. The evaluation results on Hydro-SE Bench show that the accuracy values vary among 0.74 to 0.80 for commercial LLMs, and among 0.41 to 0.68 for small-parameter LLMs. While LLMs perform well in subfields closely related to natural and physical sciences, they struggle with domain-specific knowledge such as industry standards and hydraulic structures. Model scaling mainly improves reasoning and calculation abilities, but there is still great potential for LLMs to better handle problems in practical engineering application. This study highlights the strengths and weaknesses of LLMs for Hydro-SE tasks, providing model developers with clear training targets and Hydro-SE researchers with practical guidance for applying LLMs.