Plan-Then-Execute: An Empirical Study of User Trust and Team Performance When Using LLM Agents As A Daily Assistant

作者: Gaole He, Gianluca Demartini, Ujwal Gadiraju

分类: cs.HC, cs.CL

发布日期: 2025-02-03

备注: conditionally accepted to CHI 2025

DOI: 10.1145/3706598.3713218

💡 一句话要点

研究LLM智能体作为日常助手时，用户信任与团队表现的影响，采用Plan-Then-Execute模式。

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 大型语言模型 人机协作 智能助手 用户信任 Plan-Then-Execute 日常任务 智能体规划

📋 核心要点

现有方法缺乏对LLM智能体如何基于规划和顺序决策能力提供日常帮助的深入理解，阻碍了其广泛应用。
论文采用Plan-Then-Execute模式，让LLM智能体进行逐步规划和执行，同时保持用户在环，以提升用户信任。
通过实证研究，发现高质量计划和用户参与执行能提升LLM智能体表现，但用户也可能轻易不信任看似合理的计划。

📝 摘要（中文）

随着ChatGPT的普及，大型语言模型（LLM）正日益影响我们的日常生活。通过配备为特定目的而设计的外部工具（例如，用于预订航班或闹钟），LLM智能体在协助人类日常工作方面的能力不断增强。尽管LLM智能体作为日常助手展现出令人期待的蓝图，但对于它们如何基于规划和顺序决策能力提供日常帮助的理解仍然有限。我们从最近的工作中汲取灵感，这些工作强调了“LLM-modulo”设置与人机协同在规划任务中的价值。我们进行了一项实证研究（N = 248），研究了LLM智能体在六个常见任务中作为日常助手的表现，这些任务通常具有不同程度的风险（例如，机票预订和信用卡支付）。为了确保用户对LLM智能体的自主性和控制权，我们采用了Plan-Then-Execute模式，其中智能体在模拟环境中进行逐步规划和逐步执行。我们分析了用户在每个阶段的参与如何影响他们的信任和协作团队表现。我们的研究结果表明，LLM智能体可能是一把双刃剑——（1）当有高质量的计划和必要的用户参与执行时，它们可以很好地工作，并且（2）用户很容易不信任那些看起来合理的LLM智能体计划。我们总结了使用LLM智能体作为日常助手的关键见解，以校准用户信任并实现更好的整体任务结果。我们的工作对日常助手和人机协作与LLM智能体的未来设计具有重要意义。

🔬 方法详解

问题定义：论文旨在研究LLM智能体作为日常助手时，用户信任度和团队协作表现的影响。现有方法缺乏对LLM智能体规划和顺序决策能力的有效利用，并且用户对智能体的信任度难以校准，导致协作效率低下。

核心思路：论文的核心在于采用“Plan-Then-Execute”模式，将任务分解为规划和执行两个阶段，并允许用户在每个阶段进行干预。这种设计旨在提高用户对智能体的控制感和理解，从而提升信任度。

技术框架：整体流程包括：1）用户提出日常任务；2）LLM智能体生成任务执行的详细计划；3）用户审核并修改计划；4）LLM智能体按照修改后的计划逐步执行；5）用户在执行过程中进行监督和干预。整个过程在模拟环境中进行，以便控制变量和收集数据。

关键创新：关键创新在于将LLM智能体的规划能力与用户的主动参与相结合，形成人机协同的闭环。这种模式不同于完全自动化的智能体，也不同于纯粹的人工操作，而是强调人与智能体的优势互补。

关键设计：研究中使用了特定的LLM模型（具体型号未知），并为其配备了外部工具，例如航班预订API和信用卡支付API。用户参与程度通过不同的干预策略进行控制，例如，允许用户在每个步骤进行确认或仅在出现错误时进行干预。用户信任度通过问卷调查和行为数据进行评估。

🖼️ 关键图片

📊 实验亮点

实验结果表明，高质量的计划和用户在执行过程中的积极参与能够显著提升LLM智能体的任务完成质量。然而，研究也发现，即使计划看起来合理，用户也可能轻易不信任LLM智能体。这提示我们在设计智能助手时，需要更加注重用户信任的校准和透明度的提升。

🎯 应用场景

该研究成果可应用于各种需要人机协作的日常任务场景，例如智能家居控制、日程管理、旅行规划、财务管理等。通过合理设计人机交互模式，可以提升用户对智能助手的信任感和使用意愿，从而提高工作效率和生活质量。未来，该研究可以扩展到更复杂的任务和更广泛的用户群体。

📄 摘要（原文）

Since the explosion in popularity of ChatGPT, large language models (LLMs) have continued to impact our everyday lives. Equipped with external tools that are designed for a specific purpose (e.g., for flight booking or an alarm clock), LLM agents exercise an increasing capability to assist humans in their daily work. Although LLM agents have shown a promising blueprint as daily assistants, there is a limited understanding of how they can provide daily assistance based on planning and sequential decision making capabilities. We draw inspiration from recent work that has highlighted the value of 'LLM-modulo' setups in conjunction with humans-in-the-loop for planning tasks. We conducted an empirical study (N = 248) of LLM agents as daily assistants in six commonly occurring tasks with different levels of risk typically associated with them (e.g., flight ticket booking and credit card payments). To ensure user agency and control over the LLM agent, we adopted LLM agents in a plan-then-execute manner, wherein the agents conducted step-wise planning and step-by-step execution in a simulation environment. We analyzed how user involvement at each stage affects their trust and collaborative team performance. Our findings demonstrate that LLM agents can be a double-edged sword -- (1) they can work well when a high-quality plan and necessary user involvement in execution are available, and (2) users can easily mistrust the LLM agents with plans that seem plausible. We synthesized key insights for using LLM agents as daily assistants to calibrate user trust and achieve better overall task outcomes. Our work has important implications for the future design of daily assistants and human-AI collaboration with LLM agents.

Plan-Then-Execute: An Empirical Study of User Trust and Team Performance When Using LLM Agents As A Daily Assistant

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理