CRISP: Persistent Concept Unlearning via Sparse Autoencoders

Welcome to the official repository for CRISP, a parameter-efficient method for persistent concept unlearning in large language models using sparse autoencoders (SAEs).

🚨 The Problem: Large language models (LLMs) memorize harmful or sensitive knowledge. Existing unlearning methods often degrade general utility or fail to permanently remove the knowledge, allowing it to resurface under specific prompting or attacks.

✅ Our Solution: CRISP leverages Sparse Autoencoders (SAEs) to automatically identify and suppress specific features activated by harmful knowledge. By fine-tuning with LoRA to suppress these features, CRISP achieves persistent removal while preserving the model's fluency and benign capabilities.

CRISP Main Method

(1) Feature Selection: Identify SAE features active on target data but not benign data. (2) Model Optimization: Fine-tune (LoRA) to suppress these features. (3) Result: Persistent unlearning with minimal collateral damage.

🎯 Key Features

🛡️ Persistent Unlearning: Modifies model weights (via LoRA) rather than just steering inference, ensuring permanent removal.
🧠 Interpretable: Uses Sparse Autoencoders to identify semantically meaningful features associated with the concept to be unlearned.
⚡ Parameter-Efficient: Updates only a small fraction of parameters using Low-Rank Adaptation (LoRA).
📊 High Precision: Disentangles harmful concepts from benign ones, preserving general model capabilities and fluency.

🚀 Quick Start

Installation

Set up your environment using the provided configuration:

# Clone the repository
git clone https://github.com/tomerashuach/CRISP.git
cd CRISP

Create and activate the conda environment:

conda env create -f environment.yml
conda activate crisp_env

Install Python dependencies:

pip install -r requirements.txt

📚 Demo

We provide a demo notebook showcasing the unlearning of the "Harry Potter" concept:

jupyter notebook demo_unlearn_hp.ipynb

This demo illustrates how CRISP identifies and suppresses features related to specific knowledge.

Supported Models

Llama-3.1-8B
Gemma-2-2B

Uses publicly available SAEs from LlamaScope and GemmaScope.

📊 Datasets

The repository supports evaluation and unlearning on:

WMDP (Weapons of Mass Destruction Proxies):
- Biosecurity: Virology knowledge vs. general biology.
- Cybersecurity: Harmful cyber instructions vs. general computer science.
Note: Due to the WMDP policy, this repo does not contain the WMDP dataset. To use CRISP on WMDP, one needs to request access via https://huggingface.co/datasets/cais/wmdp-bio-forget-corpus.
Harry Potter: Used for demonstration and analysis of copyright/book knowledge unlearning.

🔬 Method Overview

CRISP operates in two key phases:

1. 🎯 Feature Selection

Activation Statistics: Compute SAE feature activations on a Target Corpus (harmful) and a Retain Corpus (benign).
Filtering: Select features with high activation frequency and high relative activation ratio on the target set.

2. ✂️ Model Optimization

Suppression: Fine-tune the model using LoRA adapters.
Objective: Minimize activations of selected features on the target corpus while preserving original hidden states on the retain corpus.
Loss Function: Combines unlearning loss, retention loss, and coherence loss.

📖 Paper Citation

If you use CRISP in your research, please cite:

@article{ashuach2025crisp,
  title={CRISP: Persistent Concept Unlearning via Sparse Autoencoders},
  author={Ashuach, Tomer and Arad, Dana and Mueller, Aaron and Tutek, Martin and Belinkov, Yonatan},
  journal={arXiv preprint arXiv:2508.13650},
  year={2025},
  url={https://arxiv.org/abs/2508.13650}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
assets		assets
crisp		crisp
.gitignore		.gitignore
README.md		README.md
environment.yml		environment.yml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CRISP: Persistent Concept Unlearning via Sparse Autoencoders

CRISP Main Method

🎯 Key Features

🚀 Quick Start

Installation

📚 Demo

Supported Models

📊 Datasets

🔬 Method Overview

1. 🎯 Feature Selection

2. ✂️ Model Optimization

📖 Paper Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CRISP: Persistent Concept Unlearning via Sparse Autoencoders

CRISP Main Method

🎯 Key Features

🚀 Quick Start

Installation

📚 Demo

Supported Models

📊 Datasets

🔬 Method Overview

1. 🎯 Feature Selection

2. ✂️ Model Optimization

📖 Paper Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages