Correcting Selection Bias in Sparse User Feedback for Large Language Model Quality Estimation: A Multi-Agent Hierarchical Bayesian Approach

作者: Andrea Morandi, Mahesh Viswanathan

分类: cs.CL

发布日期: 2026-05-12

💡 一句话要点

提出多代理层次贝叶斯方法以纠正稀疏用户反馈中的选择偏差

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 选择偏差 层次贝叶斯 用户反馈 主题聚类 质量评估 大规模语言模型 偏差校正

📋 核心要点

现有方法在处理用户反馈时，常因用户反馈的非随机性导致系统质量评估出现40-50个百分点的偏差。
本文提出的三代理层次贝叶斯方法，通过主题聚类和偏差建模，能够在没有真实标签的情况下有效纠正选择偏差。
实验结果表明，该方法在不同的选择偏差条件下，能够将评估结果保持在真实质量的4-13个百分点之内，显示出显著的性能提升。

📝 摘要（中文）

在大规模语言模型的生产部署中，用户反馈往往来自非随机的用户群体，导致系统质量评估出现显著偏差。本文将此问题视为主题和情感分层的选择偏差问题，提出了一种三代理层次贝叶斯管道，该方法无需对单个交互进行真实标签的标注。通过UMAP和HDBSCAN对文本嵌入进行主题聚类，构建偏差建模代理，使用两阶段层次Beta-二项分布进行拟合，推断每个主题的选择率和质量，并通过真实主题的流行度进行加权，最终报告偏差校正后的聚合后验。实验验证使用UltraFeedback数据集，结果显示该方法在不同选择偏差下均能有效接近真实质量。

🔬 方法详解

问题定义：本文旨在解决大规模语言模型用户反馈中的选择偏差问题，现有方法因用户反馈的非随机性，导致质量评估结果严重偏离真实值。

核心思路：通过构建三代理层次贝叶斯模型，分别处理主题聚类、偏差建模和结果合成，从而在没有真实标签的情况下，推断出更准确的系统质量。

技术框架：整体架构包括三个主要模块：主题聚类代理利用UMAP和HDBSCAN对文本嵌入进行聚类；偏差建模代理使用NUTS算法拟合层次Beta-二项分布；合成代理则根据真实主题流行度加权质量，输出偏差校正后的后验结果。

关键创新：最重要的创新在于引入了层次贝叶斯模型和主题聚类方法，有效解决了选择偏差问题，且不依赖于真实标签，突破了传统方法的局限。

关键设计：在偏差建模中，采用了两阶段的层次Beta-二项分布，结合了反馈通道的先验信息，确保在不同选择偏差下，模型的稳定性和准确性。

🖼️ 关键图片

📊 实验亮点

实验结果显示，提出的方法在不同选择偏差条件下，能够将评估结果保持在真实质量的4-13个百分点之内，相较于传统的Naive和IPW基线方法，展现出显著的性能提升，尤其在偏差比率从1:1到30:1的情况下，95%可信区间覆盖真实质量。

🎯 应用场景

该研究的潜在应用领域包括大规模语言模型的质量评估、用户满意度分析以及在线系统的实时反馈校正。通过有效纠正选择偏差，能够提升用户体验和系统性能，对未来的智能系统优化具有重要价值。

📄 摘要（原文）

[Abridged] Production LLM deployments receive feedback from a non-random fraction of users: thumbs sit mostly in the tails of the satisfaction distribution, and a naive average over them can land 40-50 percentage points away from true system quality. We treat this as a topic- and sentiment- stratified selection-bias problem and propose a three-agent hierarchical Bayesian pipeline that does not require ground-truth labels on individual interactions. A Topic Clustering Agent partitions the stream via UMAP + HDBSCAN over text embeddings; a Bias Modeling Agent fits a two-stage hierarchical Beta-Binomial under NUTS, inferring per-topic selection rates $s_c$ and quality $q_c$ with partial pooling; a Synthesis Agent reweights $q_c$ by true topic prevalence $\hatπ_c = n_c/N$ to report a bias-corrected aggregate posterior $\bar Q = \sum_c \hatπ_c q_c$ with credible interval, plus drift signals for online recalibration. Validation uses UltraFeedback (N=10,232 retained interactions, $C=18$ clusters, $Q^\star=0.6249$) with simulated topic- and sentiment-dependent selection biases. We compare five Bayesian variants against Naive and IPW baselines. A mild prior on the feedback channel (typical positive-feedback rate and negative-to-positive ratio, both readable from any production dashboard without labels) keeps Hierarchical-Informed within 4-13 pp of $Q^\star$ as the bias ratio sweeps from 1:1 to 30:1, with 95% credible intervals covering $Q^\star$ in 50/50 random-seed replicates at $κ_{\max}=10$. Without channel-side priors, every weak-prior variant misses $Q^\star$ by 22-33 pp: the per-cluster sufficient statistics admit a one-parameter family of equally good fits, and the prior on the bias channel (not on latent quality) is what breaks the degeneracy.

Correcting Selection Bias in Sparse User Feedback for Large Language Model Quality Estimation: A Multi-Agent Hierarchical Bayesian Approach

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理