What Do People Actually Want From AI? Mapping Preference Plurality
作者: Julia Sepúlveda Coelho, Scott A. Hale
分类: cs.CL, cs.CY
发布日期: 2026-06-04
备注: Accepted at the 2026 ACM Conference on Fairness, Accountability, and Transparency (FAccT '26)
💡 一句话要点
揭示AI偏好多样性以改善人机对齐方法
🎯 匹配领域: 支柱二:RL算法与架构 (RL & Architecture) 支柱九:具身大模型 (Embodied Foundation Models)
关键词: 大型语言模型 人类反馈 偏好对齐 多样性分析 开放式回应
📋 核心要点
- 现有的对齐方法存在聚合冲突偏好和样本不具代表性的问题,导致无法准确捕捉用户需求。
- 论文通过分析开放式回应,揭示了人们对AI的多样化期望,强调了对齐方法的局限性。
- 研究结果显示,49%的受访者要求真实,但对真实的定义各异,表明当前方法难以满足实际偏好。
📝 摘要(中文)
大型语言模型(LLMs)通常通过人类反馈的强化学习(RLHF)进行微调,以与人们的偏好和价值观对齐。然而,这种方法存在已知的局限性:它聚合了相互冲突的偏好,常常依赖于不具代表性的样本,并且仅使用二元比较。通过分析来自75个国家的PRISM数据集中的1500条开放式回应,我们探讨了人们对AI系统的真实期望,并揭示了当前方法的具体失败。研究发现,不同的人对AI的期望各异,大多数价值观的请求者不足四分之一,只有“真实”这一价值观的请求者达到49%。此外,描述“真实”时,受访者的理解存在显著差异,可能导致不兼容的认识论基础。这些发现揭示了当前对齐实践中的根本问题。
🔬 方法详解
问题定义:本论文旨在解决当前AI对齐方法无法有效捕捉用户真实偏好的问题,现有方法在聚合冲突偏好和样本代表性方面存在明显不足。
核心思路:通过分析来自不同国家的开放式回应,论文揭示了人们对AI系统的多样化期望,强调了对齐方法的局限性和潜在的误解。
技术框架:研究采用定性分析方法,分析PRISM数据集中1500条开放式回应,识别用户对AI的不同期望和定义,构建出多元化的偏好模型。
关键创新:论文的创新点在于揭示了用户对“真实”等概念的多重理解,挑战了传统的二元比较方法,强调了对齐模型的复杂性。
关键设计:在分析过程中,采用了开放式回应的定性分析,关注用户的上下文区分,特别是“默认行为”与“请求行为”之间的差异。通过这种方式,研究能够更全面地捕捉用户的真实需求。
📊 实验亮点
研究发现,49%的受访者要求AI系统具备真实的能力,但对真实的定义却各不相同,显示出当前对齐方法的局限性。此外,某些功能如AI的类人行为和安全机制在用户中存在明显的争议,这表明对齐模型需要更细致的设计和理解。
🎯 应用场景
该研究的潜在应用领域包括AI系统的设计与优化,尤其是在用户体验和人机交互方面。通过更好地理解用户的多样化期望,AI开发者可以设计出更符合用户需求的系统,从而提升用户满意度和信任度。未来,该研究可能推动对齐方法的变革,使其更加人性化和灵活。
📄 摘要(原文)
Large Language Models (LLMs) are often fine-tuned through Reinforcement Learning from Human Feedback (RLHF) to align with people's preferences and values. However, this method has known limitations: it aggregates conflicting preferences, often relies on unrepresentative samples, and uses only binary comparisons. Analysing 1,500 open-ended responses from the PRISM dataset across 75 countries, we examine what people actually want from AI systems and reveal concrete failures of current methods. We find that different people want different things: most values are requested by fewer than a quarter of respondents, with truthfulness the sole exception at 49%. Furthermore, the same words hide divergent meanings: when people describe what they mean by "truthfulness", they reveal distinct, potentially incompatible, epistemological bases, as some ask for sourced claims, some for expert opinions, and some even ask for unpopular views. Certain capabilities, namely how human-like a model behaves, and some features, like AI guardrails, are outright controversial, with some desiring them and others rejecting them. We additionally find that people often use contextual distinctions (what AI should do "by default" versus "if requested") that binary comparisons cannot capture. These findings expose fundamental problems in current alignment practices. When 49% request truthfulness but define it differently, this is unlikely to be captured by a single reward model. The persistence of high hallucination rates in well-funded models, despite users' clear demands for accuracy, suggests that current methods fail to identify actual preferences. This paper sheds light on the situated, contested, imperfect signals that are currently being flattened into universal preference models, a practice others have characterised as epistemic violence.