Exemplar Partitioning for Mechanistic Interpretability

📄 arXiv: 2605.14347v1 📥 PDF

作者: Jessica Rumbelow

分类: cs.LG

发布日期: 2026-05-14

备注: Code: https://github.com/jessicarumbelow/exemplar-partitioning. Pretrained dictionaries: https://huggingface.co/datasets/J-RUM/exemplar-partitioning


💡 一句话要点

提出示例划分方法以实现机制可解释性

🎯 匹配领域: 支柱九:具身大模型 (Embodied Foundation Models)

关键词: 可解释性 特征字典 无监督学习 聚类方法 大型语言模型 因果干预 激活空间

📋 核心要点

  1. 现有方法在构建可解释特征字典时,通常需要大量的计算资源和令牌,限制了其应用。
  2. 本文提出的示例划分方法通过领导者聚类和Voronoi划分,显著减少了所需的令牌数量,同时保持了字典的可解释性。
  3. 在Gemma-2-2B模型上,EP方法在激活空间的分解上表现优异,且在AxBench潜在概念检测中取得了显著的性能提升。

📝 摘要(中文)

本文介绍了一种名为示例划分(EP)的无监督方法,用于从大型语言模型的激活中构建可解释的特征字典,其所需的令牌数量比现有稀疏自编码器(SAEs)少约1000倍。EP字典是激活空间的Voronoi划分,通过在距离阈值内对流式激活进行领导者聚类构建。每个区域由一个观察到的示例锚定,作为其成员资格标准和干预方向;字典大小不是预先设定的,而是由该阈值下的激活几何决定。由于示例是观察到的而非学习到的,因此从同一数据流构建的字典在不同层、模型和训练检查点之间是直接可比的。

🔬 方法详解

问题定义:本文旨在解决在大型语言模型中构建可解释特征字典时所面临的高计算成本和令牌需求问题。现有的稀疏自编码器方法在这方面存在明显的不足。

核心思路:示例划分(EP)方法通过无监督的方式,利用领导者聚类和Voronoi划分技术,构建激活空间的可解释字典,减少了对计算资源的需求。

技术框架:EP方法的整体架构包括流式激活的聚类、Voronoi划分的构建以及示例的选择与干预方向的确定。每个区域由一个示例锚定,字典大小根据激活几何动态调整。

关键创新:EP的主要创新在于使用观察到的示例而非学习到的特征,使得不同层、模型和训练检查点之间的字典可直接比较,提升了可解释性和实用性。

关键设计:EP方法的关键设计包括距离阈值的选择、聚类算法的实现以及字典构建过程中的示例选择策略,这些设计确保了字典的有效性和可解释性。

🖼️ 关键图片

fig_0
fig_1
fig_2

📊 实验亮点

在Gemma-2-2B模型上,EP方法的字典区域具有良好的可解释性,并支持因果干预。在AxBench潜在概念检测中,EP在p1条件下达到平均AUROC 0.881,较传统GemmaScope SAE提升了0.126,且计算需求减少约1000倍。

🎯 应用场景

该研究的潜在应用领域包括自然语言处理中的模型可解释性、特征选择和干预分析等。通过构建可解释的特征字典,研究人员和工程师可以更好地理解和优化大型语言模型的行为,推动人工智能的透明性和可控性。

📄 摘要(原文)

We introduce Exemplar Partitioning (EP), an unsupervised method for constructing interpretable feature dictionaries from large language model activations with $\sim 10^{3}\times$ fewer tokens than comparable sparse autoencoders (SAEs). An EP dictionary is a Voronoi partition of activation space, built by leader-clustering streamed activations within a distance threshold. Each region is anchored by an observed exemplar that serves as both its membership criterion and intervention direction; dictionary size is not prespecified, but determined by the activation geometry at that threshold. Because exemplars are observed rather than learned, dictionaries built from the same data stream are directly comparable across layers, models, and training checkpoints. We characterise EP as an interpretability object via targeted demonstrations of properties newly accessible through this construction, plus one head-to-head benchmark. In Gemma-2-2B, EP dictionary regions are interpretable and support causal interventions: refusal in instruction-tuned Gemma concentrates in a region whose exemplar ablation can collapse held-out refusal. Cross-checkpoint matching between base and instruction-tuned dictionaries separates the directions preserved through finetuning from those introduced by it. EP regions and Gemma Scope SAE features decompose activation space differently but agree on a shared core: $\sim 20\%$ of EP regions match an SAE feature at $F_{1} > 0.5$, and EP one-hot probes retain $\sim 97\%$ of raw-activation probe accuracy at $\ell_{0} = 1$. Nearest-exemplar distance provides a free out-of-distribution signal at inference. On AxBench latent concept detection at Gemma-2-2B-it L20, EP at $p_{1}$ reaches mean AUROC $0.881$, $+0.126$ over the canonical GemmaScope SAE leaderboard entry and within $0.030$ of SAE-A's $0.911$, at $\sim 10^{3}\times$ less build compute.