Self-Evaluation Is Already There: Eliciting Latent Judge Calibration in Base LLMs with Minimal Data

作者: XiuYu Zhang, Yi Shan, Junfeng Fang, Zhenkai Liang

分类: cs.CL

发布日期: 2026-06-03

💡 一句话要点

提出自我评估引导方法以提升大语言模型的输出质量预测

🎯 匹配领域: 支柱二：RL算法与架构 (RL & Architecture) 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 自我评估 大语言模型 强化学习 蒸馏训练 模型校准 输出质量 评审者偏好

📋 核心要点

现有方法在评估模型输出质量时，往往依赖于外部评审者，缺乏模型自我评估的能力。
本文提出自我评估引导（SEE）方法，通过校准强化学习和掩蔽蒸馏阶段，激发模型的自我评估能力。
SEE方法在160个独特示例上表现出色，相较于强化学习基线，减少了约31倍的数据需求，同时提升了模型的校准性能。

📝 摘要（中文）

随着大语言模型（LLM）越来越多地被其他模型评估，本文探讨了模型是否能够预测评审者对其输出的评分。研究发现，基础模型在未经过针对性训练的情况下，已经具备了这一能力。我们提出了自我评估引导（SEE）方法，通过一个短周期的校准强化学习阶段和掩蔽蒸馏阶段，提升了模型的输出质量预测能力。实验结果表明，SEE在三个基准测试中显著改善了模型的校准性能，同时保持了输出质量，表明模型内部的自我评估能力是可转移的，而非单一评审者的偏好。

🔬 方法详解

问题定义：本文旨在解决大语言模型在自我评估能力不足的问题，现有方法依赖外部评审者，缺乏模型内部的自我评估机制。

核心思路：提出自我评估引导（SEE）方法，通过短周期的校准强化学习和掩蔽蒸馏，激发并提升模型的自我评估能力。

技术框架：SEE方法包括两个主要阶段：首先是校准强化学习阶段，模型在此阶段优化输出并预测评审者评分；其次是掩蔽蒸馏阶段，进一步提升预测准确性，同时保持输出内容不变。

关键创新：SEE的核心创新在于通过引导模型自我评估，而非单纯依赖外部评审者，从而实现了自我评估能力的激发与提升。

关键设计：在SEE方法中，强化学习阶段的损失函数设计为同时考虑输出质量和预测准确性，掩蔽蒸馏阶段则通过对模型内部token分布的精细调整来优化预测结果。

🖼️ 关键图片

📊 实验亮点

实验结果显示，SEE方法在三个基准测试中显著提高了模型的校准性能，相较于强化学习基线，数据需求减少了约31倍，同时保持了输出质量的稳定性，展示了其优越性。

🎯 应用场景

该研究的潜在应用场景包括自动化内容生成、智能问答系统以及任何需要模型自我评估的任务。通过提升模型的自我评估能力，可以在更少的数据需求下实现更高质量的输出，具有重要的实际价值和未来影响。

📄 摘要（原文）

Large language models are increasingly evaluated by other models, raising a natural question: can a model predict how a judge will score its own output? We find that the ability is largely present before any targeted training: prompted few-shot, a base model already predicts an external judge's multi-attribute quality scores on open-ended responses well above chance across three benchmarks. We introduce Self-Evaluation Elicitation (SEE), a method that surfaces this latent ability through a short cycle comprising a calibration-coupled reinforcement learning phase that improves the answer and predicts the judge, followed by a masked distillation phase that sharpens the prediction while leaving the answer untouched. From 160 unique examples, roughly 31x fewer than a reinforcement learning baseline, SEE improves held-out calibration across three benchmarks while preserving answer quality. The elicited self-evaluation is sharply localized within the model's own token distribution and stable across judges it was never trained against, indicating a transferable notion of quality rather than a single judge's preference. These results reframe judge-aligned self-evaluation as a problem of elicitation rather than acquisition.

Self-Evaluation Is Already There: Eliciting Latent Judge Calibration in Base LLMs with Minimal Data

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理