The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations

作者: Carlos Arriaga, Gonzalo Martínez, Eneko Sendin, Javier Conde, Pedro Reviriego

分类: cs.AI, cs.CL

发布日期: 2025-07-17

💡 一句话要点

提出GEA：一个将能耗纳入大语言模型人工评估的平台

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 大语言模型评估 人工评估 能耗意识 能源效率 GEA平台

📋 核心要点

现有大语言模型评估方法，如自动基准测试，与人类评估结果相关性差，而传统人工评估成本高昂且难以扩展。
GEA平台通过向用户展示模型的能耗信息，研究能耗意识对人类选择模型的影响，旨在优化模型选择策略。
初步实验结果表明，在用户了解能耗的情况下，更倾向于选择能耗较低的模型，暗示复杂模型的额外开销可能不值得。

📝 摘要（中文）

大语言模型的评估是一项复杂的任务，已经提出了几种方法。最常见的是使用自动基准测试，其中LLM必须回答不同主题的多项选择题。然而，这种方法有一定的局限性，最令人担忧的是与人类的关联性较差。另一种方法是让人类评估LLM。这带来了可扩展性问题，因为需要评估的模型数量庞大且不断增长，这使得运行传统的、基于招募评估者并让他们对模型响应进行排序的研究变得不切实际（且成本高昂）。另一种方法是使用公共平台，例如流行的LM arena，任何用户都可以在该平台上自由评估模型在任何问题上的表现，并对两个模型的响应进行排序。然后将结果整理成模型排名。LLM的一个日益重要的方面是它们的能耗，因此，评估能耗意识如何影响人类在选择模型时的决策是很有意义的。在本文中，我们提出了GEA，即Generative Energy Arena，一个在评估过程中纳入模型能耗信息的平台。还介绍了使用GEA获得的初步结果，表明对于大多数问题，当用户意识到能耗时，他们更喜欢更小、更节能的模型。这表明，对于大多数用户交互而言，更复杂、性能更高的模型所产生的额外成本和能源消耗并没有提高响应的感知质量，因此不值得使用。

🔬 方法详解

问题定义：现有的大语言模型评估方法存在不足。自动基准测试虽然效率高，但与人类的感知差异较大。传统的人工评估方法，例如招募评估员进行排序，成本高昂且难以扩展，无法适应快速增长的模型数量。此外，忽略了模型的能耗因素，未能引导用户选择更节能的模型。

核心思路：论文的核心思路是创建一个名为GEA（Generative Energy Arena）的在线平台，该平台允许用户在评估大语言模型时，同时了解模型的能耗信息。通过观察用户在知晓能耗情况下的选择偏好，研究能耗意识对模型选择的影响，从而推动更节能的模型开发和应用。

技术框架：GEA平台的核心是一个在线竞技场，用户可以在该平台上比较两个大语言模型在特定问题上的回答。与传统的LM arena不同，GEA在展示模型回答的同时，还会显示模型的能耗信息。用户根据回答质量和能耗信息，选择更优的模型。平台收集用户的选择数据，并进行统计分析，以评估能耗意识对模型选择的影响。

关键创新：GEA的关键创新在于将能耗信息融入到大语言模型的人工评估流程中。通过提供能耗数据，GEA能够引导用户在模型选择时考虑能源效率，从而促进更可持续的大语言模型发展。这是对传统评估方法的重要补充，有助于弥合自动评估与人类感知的差距。

关键设计：GEA平台的设计重点在于用户体验和数据收集。平台界面简洁直观，方便用户比较模型回答和能耗信息。能耗信息的展示方式需要清晰易懂，例如可以使用相对能耗指标或能耗等级。平台需要记录用户的选择行为，并收集用户的反馈意见，以便进行深入分析。此外，GEA平台可以集成不同的能耗测量工具，以确保能耗数据的准确性。

📊 实验亮点

GEA平台的初步实验结果表明，当用户了解模型的能耗信息时，他们更倾向于选择能耗较低的模型。这表明，对于许多用户交互场景，复杂模型的额外计算开销可能并不值得，用户更看重能耗与性能的平衡。该发现为大语言模型的优化和部署提供了重要参考。

🎯 应用场景

GEA平台可应用于大语言模型的选择和部署，帮助用户在性能和能耗之间做出权衡。企业可以利用GEA数据优化模型选择策略，降低运营成本和碳排放。该研究还可促进节能型大语言模型的设计和开发，推动人工智能的可持续发展。

📄 摘要（原文）

The evaluation of large language models is a complex task, in which several approaches have been proposed. The most common is the use of automated benchmarks in which LLMs have to answer multiple-choice questions of different topics. However, this method has certain limitations, being the most concerning, the poor correlation with the humans. An alternative approach, is to have humans evaluate the LLMs. This poses scalability issues as there is a large and growing number of models to evaluate making it impractical (and costly) to run traditional studies based on recruiting a number of evaluators and having them rank the responses of the models. An alternative approach is the use of public arenas, such as the popular LM arena, on which any user can freely evaluate models on any question and rank the responses of two models. The results are then elaborated into a model ranking. An increasingly important aspect of LLMs is their energy consumption and, therefore, evaluating how energy awareness influences the decisions of humans in selecting a model is of interest. In this paper, we present GEA, the Generative Energy Arena, an arena that incorporates information on the energy consumption of the model in the evaluation process. Preliminary results obtained with GEA are also presented, showing that for most questions, when users are aware of the energy consumption, they favor smaller and more energy efficient models. This suggests that for most user interactions, the extra cost and energy incurred by the more complex and top-performing models do not provide an increase in the perceived quality of the responses that justifies their use.

The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理