Subspace-Aware Sparse Autoencoders for Effective Mechanistic Interpretability
作者: Seyed Arshan Dalili, Mehrdad Mahdavi
分类: cs.LG, cs.AI
发布日期: 2026-06-04
💡 一句话要点
提出子空间感知稀疏自编码器以解决特征分裂问题
🎯 匹配领域: 支柱九:具身大模型 (Embodied Foundation Models)
关键词: 稀疏自编码器 机制可解释性 多维特征 块稀疏性 自然语言处理 深度学习 模型优化
📋 核心要点
- 现有稀疏自编码器假设每个潜在特征只有一个解码方向,导致特征分裂和几何结构模糊。
- 提出的SASA通过学习解码子空间和块稀疏性,能够更好地表示多维特征,避免特征分裂。
- 在GPT-2和Mistral-7B上,SASA显著提高了单义性和可解释性,同时训练成本降低至标准SAEs的一半。
📝 摘要(中文)
稀疏自编码器(SAEs)在大型语言模型的机制可解释性中广泛应用,但其单一解码器方向的假设与模型特征的多维结构不匹配,导致特征分裂。本文提出子空间感知稀疏自编码器(SASA),通过学习解码子空间和块稀疏性,显著减少特征分裂,提高可解释性,并在GPT-2和Mistral-7B上表现优于标准SAEs,训练成本降低至约一半的token预算。
🔬 方法详解
问题定义:本文解决的是稀疏自编码器在处理多维特征时的不足,现有方法假设特征为一维,导致特征分裂和几何结构模糊。
核心思路:提出子空间感知稀疏自编码器(SASA),通过引入学习的解码子空间和块稀疏性,能够有效表示多维特征,避免特征的分裂现象。
技术框架:SASA的整体架构包括学习解码子空间、通过Top-$s$组门控实现块稀疏性,以及使用核范数正则化调整每组的有效秩。
关键创新:SASA的主要创新在于用学习的解码子空间替代单向解码器,能够在块大小满足$r ge d_i$时,单个组不仅可以表示整个特征切片,还能成为SASA目标的全局最小化者。
关键设计:在设计中,采用了核范数正则化来调整组的有效秩,并通过块稀疏性来增强模型的表达能力,确保在训练过程中减少特征分裂。
🖼️ 关键图片
📊 实验亮点
在实验中,SASA在GPT-2和Mistral-7B上显著减少了特征分裂和吸收现象,提高了单义性和可解释性。与标准稀疏自编码器相比,SASA在训练时的token预算减少至约一半,同时在性能上匹配或超越了标准SAEs。
🎯 应用场景
该研究具有广泛的应用潜力,尤其在自然语言处理和机器学习领域。通过提高模型的可解释性,SASA可以帮助研究人员和开发者更好地理解和优化大型语言模型的内部机制,进而推动智能系统的透明性和可靠性。
📄 摘要(原文)
Sparse Autoencoders (SAEs) are widely used for mechanistic interpretability in large language models, yet their formulation assigns each latent feature a single decoder direction, implicitly assuming features to be one-dimensional. We show that this assumption mismatches with the multi-dimensional structure of model features, provably inducing feature splitting through two distinct mechanisms. Geometrically, reconstructing a feature of intrinsic dimension $d_i \ge 2$ to error $\varepsilon$ with single-direction decoders forces a number of atoms that is exponential in $d_i$. From an end-to-end optimization perspective, this splitting is not merely possible but actively preferred. We prove that there exists a continuous path from the true $d_i$-dimensional basis to a strictly lower risk of the $\ell_1$-regularized SAE objective, whose descent directions drive any trained dictionary into that exponential regime. A single coherent feature is therefore fragmented across many near-collinear latents, producing spurious multiplicity and obscuring the intrinsic geometry. Motivated by this, we introduce Subspace-Aware Sparse Autoencoders (SASA), which replace single-vector decoders with learned decoder subspaces, enforce block sparsity via Top-$s$ group gating, and adapt each group's effective rank with a nuclear-norm regularizer. We then show that once the block size satisfies $r \ge d_i$, a single group not only can represent the entire feature slice but is the global minimizer of the SASA objective. This consolidation yields a sample complexity polynomial in $d_i$ rather than exponential -- a decisive advantage given that every training activation costs an LLM forward pass. Empirically, on GPT-2 and Mistral-7B, SASA reduces feature splitting and absorption, improves monosemanticity and interpretability, and matches or exceeds standard SAEs while training on roughly half the token budget.