How Pruning Reshapes Features: Sparse Autoencoder Analysis of Weight-Pruned Language Models

作者: Hector Borobia, Elies Seguí-Mas, Guillermina Tormo-Carbó

分类: cs.LG, cs.AI

发布日期: 2026-03-26

备注: 27 pages, 6 figures, 6 tables. Analysis covers Gemma 3 1B, Gemma 2 2B, and Llama 3.2 1B across 22 experimental runs. Code and data available at https://github.com/hborobia/sae-pruning-paper

💡 一句话要点

利用稀疏自编码器分析权重剪枝对语言模型特征的影响

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 权重剪枝 语言模型 稀疏自编码器 特征分析 模型压缩

📋 核心要点

现有方法缺乏对权重剪枝如何影响语言模型内部表征的深入理解，阻碍了模型压缩和可解释性的进一步发展。
该研究利用稀疏自编码器（SAE）作为探针，系统分析了不同剪枝方法和稀疏度下，语言模型特征几何结构的改变。
实验结果表明，稀有特征比频繁特征更能抵抗剪枝，Wanda剪枝优于magnitude剪枝，且特征生存率与因果重要性不相关。

📝 摘要（中文）

权重剪枝是压缩大型语言模型的常用技术，但其对学习到的内部表征的影响仍然知之甚少。本文首次系统地研究了非结构化剪枝如何重塑语言模型的特征几何结构，使用稀疏自编码器（SAE）作为可解释性探针。在三个模型系列（Gemma 3 1B、Gemma 2 2B、Llama 3.2 1B）、两种剪枝方法（magnitude和Wanda）和六个稀疏度级别（0-60%）上，我们研究了五个研究问题，涵盖种子稳定性、特征生存、SAE可迁移性、特征脆弱性和因果相关性。最引人注目的发现是，稀有SAE特征（即低激活率的特征）比频繁特征更能抵抗剪枝，在17个实验条件中的11个条件下，条件内的Spearman相关性为rho = -1.0。这一反直觉的结果表明，剪枝充当隐式特征选择，优先破坏高频通用特征，同时保留专门的稀有特征。我们进一步表明，Wanda剪枝比magnitude剪枝更好地保留特征结构，最高可达3.7倍，预训练的SAE在Wanda剪枝模型上仍然可行，最高可达50%的稀疏度，并且几何特征生存率不能预测因果重要性——这种分离对压缩下的可解释性具有影响。

🔬 方法详解

问题定义：论文旨在解决权重剪枝对大型语言模型内部表征影响不明确的问题。现有方法缺乏对剪枝后模型特征几何结构变化的系统性分析，难以理解剪枝如何影响模型的性能和可解释性。

核心思路：论文的核心思路是利用稀疏自编码器（SAE）作为可解释性探针，通过分析剪枝前后SAE学习到的特征变化，来理解剪枝对语言模型内部表征的影响。SAE能够提取模型中的稀疏特征，从而揭示剪枝过程中哪些特征被保留，哪些特征被删除。

技术框架：整体框架包括以下几个主要步骤：1) 选择预训练语言模型（Gemma 3 1B, Gemma 2 2B, Llama 3.2 1B）；2) 使用magnitude和Wanda两种剪枝方法，设置不同的稀疏度级别（0-60%）；3) 使用剪枝后的模型训练SAE；4) 分析SAE学习到的特征，包括特征的激活频率、生存率、可迁移性和因果重要性。

关键创新：最重要的技术创新点在于发现稀有特征比频繁特征更能抵抗剪枝。这一发现颠覆了传统的认知，表明剪枝不仅仅是简单地删除不重要的权重，而是一种隐式的特征选择机制，优先保留了对模型性能至关重要的稀有特征。

关键设计：关键设计包括：1) 使用稀疏自编码器作为可解释性工具，能够有效地提取和分析语言模型的内部特征；2) 比较magnitude和Wanda两种剪枝方法，发现Wanda剪枝更能保留特征结构；3) 通过分析特征的激活频率、生存率、可迁移性和因果重要性，全面评估剪枝对模型特征的影响。

🖼️ 关键图片

📊 实验亮点

实验结果表明，稀有SAE特征比频繁特征更能抵抗剪枝（Spearman相关性rho = -1.0）。Wanda剪枝在保留特征结构方面优于magnitude剪枝，最高可达3.7倍。预训练的SAE在Wanda剪枝模型上仍然可行，最高可达50%的稀疏度。此外，几何特征生存率与因果重要性不相关，表明需要更谨慎地解释剪枝后模型的特征。

🎯 应用场景

该研究成果可应用于模型压缩、模型可解释性分析等领域。通过理解剪枝对模型特征的影响，可以设计更有效的剪枝策略，在保证模型性能的同时，进一步压缩模型大小。此外，该研究也有助于理解大型语言模型的内部工作机制，为开发更具可解释性的人工智能系统奠定基础。

📄 摘要（原文）

Weight pruning is a standard technique for compressing large language models, yet its effect on learned internal representations remains poorly understood. We present the first systematic study of how unstructured pruning reshapes the feature geometry of language models, using Sparse Autoencoders (SAEs) as interpretability probes. Across three model families (Gemma 3 1B, Gemma 2 2B, Llama 3.2 1B), two pruning methods (magnitude and Wanda), and six sparsity levels (0--60%), we investigate five research questions spanning seed stability, feature survival, SAE transferability, feature fragility, and causal relevance. Our most striking finding is that rare SAE features--those with low firing rates--survive pruning far better than frequent ones, with within-condition Spearman correlations of rho = -1.0 in 11 of 17 experimental conditions. This counter-intuitive result suggests that pruning acts as implicit feature selection, preferentially destroying high-frequency generic features while preserving specialized rare ones. We further show that Wanda pruning preserves feature structure up to 3.7x better than magnitude pruning, that pre-trained SAEs remain viable on Wanda-pruned models up to 50% sparsity, and that geometric feature survival does not predict causal importance--a dissociation with implications for interpretability under compression.

How Pruning Reshapes Features: Sparse Autoencoder Analysis of Weight-Pruned Language Models

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理