TracrBench: Generating Interpretability Testbeds with Large Language Models

作者: Hannes Thurnherr, Jérémy Scheurer

分类: cs.CL, cs.AI, cs.LG

发布日期: 2024-09-07

备注: 6 pages + appendix, 4 figures, ICML Mechanistic Interpretability Workshop

💡 一句话要点

提出TracrBench，利用LLM生成可解释性测试集，加速Transformer模型理解。

🎯 匹配领域: 支柱九：具身大模型 (Embodied Foundation Models)

关键词: 可解释性 Transformer模型 大型语言模型 测试集生成 RASP程序

📋 核心要点

Transformer模型可解释性差，缺乏模型权重与功能角色的对应关系，难以评估可解释性方法。
利用LLM自动生成RASP程序及其Transformer权重，构建可解释性测试集TracrBench。
评估了GPT-4-turbo生成RASP程序的能力，发现存在挑战，需要人工辅助生成数据集。

📝 摘要（中文）

Transformer语言模型的可解释性研究面临挑战，特别是由于模型参数众多，且模型权重与功能角色之间缺乏明确的对应关系，阻碍了解释性方法有效性的评估。为了解决这个问题，Tracr提出了一种在RASP中生成具有固有ground truth映射的编译Transformer的方法。然而，手动创建大量模型来验证可解释性方法既费力又耗时。本文提出了一种利用大型语言模型（LLM）生成可解释性测试集的新方法，并介绍了TracrBench，这是一个包含121个手动编写和LLM生成的、经过人工验证的RASP程序及其对应的Transformer权重的数据集。在此过程中，我们评估了前沿LLM自主生成RASP程序的能力，发现这项任务极具挑战性。GPT-4-turbo在20-shot提示和best-of-5采样下，仅正确实现了101个测试程序中的57个，因此需要手动实现剩余的程序。TracrBench包含121个样本，旨在作为评估和比较可解释性方法的宝贵测试平台。

🔬 方法详解

问题定义：现有Transformer模型的可解释性研究面临挑战，主要痛点在于模型参数量巨大，且模型权重与实际功能之间的映射关系未知。这使得评估和改进可解释性方法变得困难，阻碍了对Transformer模型内部机制的深入理解。手动构建具有ground truth映射的测试用例成本高昂，难以满足研究需求。

核心思路：利用大型语言模型（LLM）的生成能力，自动生成具有已知ground truth的RASP程序，并将其编译为Transformer模型。通过这种方式，可以快速构建大规模的可解释性测试集，用于评估和比较不同的可解释性方法。核心在于利用LLM的编程能力，降低构建测试集的成本，并保证测试集具有明确的ground truth。

技术框架：TracrBench的构建流程主要包括以下几个阶段：1) 人工编写和LLM生成RASP程序；2) 人工验证LLM生成的RASP程序的正确性，并进行修正；3) 将RASP程序编译为Transformer模型，得到对应的模型权重；4) 将RASP程序和对应的Transformer权重组成TracrBench数据集。其中，LLM生成阶段采用了prompt工程和采样策略，以提高生成程序的质量。

关键创新：该方法的核心创新在于利用LLM自动生成可解释性测试集。与传统的手动构建方法相比，该方法能够显著降低成本，并快速生成大规模的测试集。此外，通过人工验证和修正，保证了测试集的质量和可靠性。

关键设计：在LLM生成阶段，采用了20-shot prompt和best-of-5采样策略。20-shot prompt是指在提示LLM生成RASP程序时，提供了20个示例程序作为参考。best-of-5采样是指LLM生成5个候选程序，然后选择其中最好的一个。此外，还设计了人工验证流程，用于检查和修正LLM生成的程序，确保其功能正确。

🖼️ 关键图片

📊 实验亮点

实验结果表明，GPT-4-turbo在20-shot提示和best-of-5采样下，仅正确实现了101个测试程序中的57个，表明LLM自主生成复杂RASP程序仍然具有挑战性。尽管如此，TracrBench数据集的构建仍然大大降低了构建可解释性测试集的成本，为后续研究提供了便利。该数据集包含121个样本，为评估和比较可解释性方法提供了充足的数据。

🎯 应用场景

TracrBench可用于评估和比较各种Transformer模型的可解释性方法，例如注意力机制可视化、梯度分析、知识提取等。该数据集能够帮助研究人员更好地理解Transformer模型的内部机制，并开发更有效的可解释性方法。此外，TracrBench还可以用于训练和评估LLM的编程能力，促进LLM在软件工程领域的应用。

📄 摘要（原文）

Achieving a mechanistic understanding of transformer-based language models is an open challenge, especially due to their large number of parameters. Moreover, the lack of ground truth mappings between model weights and their functional roles hinders the effective evaluation of interpretability methods, impeding overall progress. Tracr, a method for generating compiled transformers with inherent ground truth mappings in RASP, has been proposed to address this issue. However, manually creating a large number of models needed for verifying interpretability methods is labour-intensive and time-consuming. In this work, we present a novel approach for generating interpretability test beds using large language models (LLMs) and introduce TracrBench, a novel dataset consisting of 121 manually written and LLM-generated, human-validated RASP programs and their corresponding transformer weights. During this process, we evaluate the ability of frontier LLMs to autonomously generate RASP programs and find that this task poses significant challenges. GPT-4-turbo, with a 20-shot prompt and best-of-5 sampling, correctly implements only 57 out of 101 test programs, necessitating the manual implementation of the remaining programs. With its 121 samples, TracrBench aims to serve as a valuable testbed for evaluating and comparing interpretability methods.

TracrBench: Generating Interpretability Testbeds with Large Language Models

💡 一句话要点

📋 核心要点

📝 摘要（中文）

🔬 方法详解

🖼️ 关键图片

📊 实验亮点

🎯 应用场景

📄 摘要（原文）

⭐ 我的收藏

📁 新建收藏夹

⚙️ 管理收藏夹

🔍 搜索论文

🔐 登录 / 注册

👤 用户管理