The Neutral Mask: How RLHF Provides Shallow Alignment while Leaving Partisan Structure Intact in a Large Language Model
作者: Wendy K. Tam
分类: cs.CL
发布日期: 2026-06-08
💡 一句话要点
提出中立掩码以解决RLHF在语言模型中的偏见问题
🎯 匹配领域: 支柱二:RL算法与架构 (RL & Architecture) 支柱九:具身大模型 (Embodied Foundation Models)
关键词: 强化学习 人类反馈 语言模型 偏见结构 政治中立性 机制研究 特征分析
📋 核心要点
- 现有的RLHF方法在对齐大型语言模型时,仅实现了功能性合规,而未能消除模型中的偏见结构。
- 论文提出通过分析Llama 3.1 8B模型的内部表示,揭示RLHF如何在不去除偏见的情况下实现中立输出。
- 实验结果表明,RLHF切断了偏见几何与输出之间的因果路径,导致模型在生成输出时表现出政治中立性。
📝 摘要(中文)
对齐训练的目标是使大型语言模型安全且有用。主要机制是通过人类反馈的强化学习(RLHF)来调整模型行为,使其与“人类价值观”对齐。然而,这一过程并不透明,RLHF仅产生功能性合规,而非深层次对齐。本文通过对Llama 3.1 8B模型在RLHF前后的内部表示进行比较,展示了RLHF并未消除模型中的结构性偏见,而是压缩了偏见信号的方差,生成一致的中立输出。关键实验表明,RLHF通过切断偏见几何与输出生成之间的因果路径来编码政治中立性,但底层的偏见结构依然存在。
🔬 方法详解
问题定义:本文旨在解决RLHF在大型语言模型中产生的偏见问题,现有方法未能深层次消除偏见结构,导致模型输出仍然受到偏见影响。
核心思路:通过对Llama 3.1 8B模型进行机制性案例研究,揭示RLHF如何通过压缩偏见信号的方差来实现表面上的中立性,而非根本性的去偏见。
技术框架:研究首先分析模型在RLHF前后的内部表示,接着使用稀疏自编码器分解技术,最后通过特征级别的引导实验确认因果断裂。
关键创新:本研究的创新在于揭示RLHF并未消除偏见结构,而是通过切断因果路径实现功能性中立性,这与传统对齐方法的根本区别。
关键设计:采用稀疏自编码器分解技术分析模型特征,确认政策编码特征在指令模型中完全不活跃,且通过特征级别实验验证了因果断裂的存在。
🖼️ 关键图片
📊 实验亮点
实验结果显示,RLHF在Llama 3.1 8B模型中未能消除偏见结构,而是将偏见信号的方差压缩,生成的输出在功能上表现出政治中立性。特征级别实验确认了因果断裂的存在,表明模型在生成过程中仍然保留了偏见的潜在结构。
🎯 应用场景
该研究的潜在应用领域包括政治相关的文本生成、社交媒体内容审核及自动化客服等。通过理解RLHF的局限性,研究者可以更好地设计对齐机制,以确保模型在处理敏感话题时的中立性和安全性,未来可能对AI伦理和政策制定产生深远影响。
📄 摘要(原文)
The ambition behind alignment training is to make large language models safe and useful. The primary mechanism, reinforcement learning from human feedback (RLHF), shapes the behavior of deployed language models by aligning them with ``human values.'' Yet the process is opaque. What values are being encoded; whose values are they; and how does RLHF encode them? A growing body of evidence suggests that RLHF produces only functional compliance rather than deep alignment. We offer a mechanistic case study of this phenomenon for partisan political orientation with a comparison of the internal representations of Llama 3.1 8B before and after RLHF. We show that RLHF does not remove the structured partisan direction in the base model. Instead, it compresses the variance of the partisan signal to generate consistently balanced and non-partisan output. Sparse autoencoder decomposition reveals that policy-encoding features, which activate sporadically in the base model, are completely inactive in the Instruct model. Feature-level steering experiments confirm the causal disconnect. RLHF thus encodes a norm of political neutrality, not by erasing the model's knowledge of partisanship, but by severing the causal pathway from partisan geometry to output generation. Importantly, this neutrality is functional, not structural so that the underlying geometry that enables partisan steering remains intact. The mechanisms that bypass RLHF's guardrails, such as inferring and amplifying a user's partisan identity, reactivate partisan generation. If RLHF operates by disconnecting rather than removing value-laden structure, then the same pattern may hold for other value domains, and the aligned model's behavior may be more fragile than its outputs suggest.