Beyond Averages: Evaluating LLMs on Human Survey Replication at the Distributional Level

作者: Jeonghyeon Moon, Jiwon Kim, Yeheum Lah, Yoonju Han, Yuncheol Kang

分类: cs.CL

发布日期: 2026-06-08

💡 一句话要点

提出基于分布层面的评估方法以改进LLM对人类调查的模拟

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 大型语言模型 人类调查模拟 分布层面评估 消费者行为 市场调查

📋 核心要点

现有方法主要通过均值评估LLMs的调查复制能力，未能充分反映人类行为的多样性和分布特征。
本文提出在分布层面评估LLMs的调查响应，使用真实的消费选择实验数据进行比较，关注不同统计类型的响应变量。
实验结果显示，LLMs在条件模式上表现良好，但在分布结构的再现上存在明显不足，且输入配置对结果有显著影响。

📝 摘要（中文）

大型语言模型（LLMs）越来越多地用于模拟人类调查响应，但之前的研究主要通过均值或聚合一致性进行评估，未能深入探讨LLMs是否能再现人类行为的变异性。本文在2010年的韩国方便面消费选择实验中，评估LLM在分布层面的调查复制能力，比较了人类与LLM在均值、模式和分布对齐方面的响应。结果表明，LLMs在条件级别模式上表现良好，但在捕捉分布结构方面存在不足，尤其是在购买数量方面，未能超越简单匹配人类分布的基线。输入配置的变化也影响复制效果，结构化角色和多模态输入有助于提升对齐，而明确推理提示则会导致性能下降。

🔬 方法详解

问题定义：本文旨在解决现有LLMs在模拟人类调查响应时，仅依赖均值评估所带来的局限性，特别是无法捕捉人类行为的分布特征。

核心思路：通过在分布层面评估LLMs的调查复制能力，比较人类与LLM在不同统计类型响应变量上的一致性，以揭示LLMs的真实表现。

技术框架：研究使用了2010年的韩国方便面消费选择实验数据，评估了三种响应变量：二元购买发生率、分类品牌选择和购买数量，分析了均值、模式和分布对齐的情况。

关键创新：最重要的创新在于提出了分布层面的评估方法，强调了均值评估的局限性，揭示了LLMs在捕捉人类行为变异性方面的不足。

关键设计：实验中使用了多种输入配置，包括结构化角色和多模态输入，发现这些设计能够改善对齐效果，而明确推理提示则导致性能下降。实验还比较了LLMs与基线模型的表现，验证了提出方法的有效性。

🖼️ 关键图片

📊 实验亮点

实验结果显示，LLMs在条件级别的模式再现上表现良好，但在捕捉购买数量的分布结构上未能超越简单的基线模型。具体而言，LLMs的表现未能优于仅匹配人类分布的条件无关基线，突显了均值评估的误导性。输入配置的变化对结果有显著影响，结构化角色和多模态输入的使用提升了对齐效果。

🎯 应用场景

该研究的潜在应用领域包括市场调查、消费者行为分析和人机交互等。通过改进LLMs在模拟人类调查响应的能力，可以为企业提供更准确的市场洞察，帮助优化产品设计和营销策略。未来，该方法还可能推动更广泛的AI应用，提升人机协作的效果。

📄 摘要（原文）

LLMs are increasingly used to simulate human survey responses, but prior work has mainly evaluated replication using mean-level or aggregate agreement, offering limited insight into whether LLMs reproduce the variability of human behavior. We evaluate LLM-based survey replication at the distributional level using a non-public 2010 consumer choice experiment on Korean instant noodle purchases, a setting unlikely to overlap with model training data. We evaluate three response variables of differing statistical type: binary purchase incidence, categorical brand choice, and count purchase quantity. For each, we compare human and LLM responses at mean-level, pattern, and distributional alignment, and against reference baselines from the human data alone. LLMs reproduce condition-level patterns reasonably well but fail to capture distributional structure: for purchase quantity, no model beats a condition-insensitive baseline that simply matches the pooled human distribution. Because models that match human means well can still produce distributions further from humans than this baseline, mean-based evaluation alone can be actively misleading. Replication also varies with input configuration, with structured personas and multimodal inputs improving alignment while explicit reasoning prompting degrades it monotonically.

Beyond Averages: Evaluating LLMs on Human Survey Replication at the Distributional Level

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理