Linear probes and activation steering for transformer safety research
Built on ideas from Representation Engineering (2023)
reprobe is a tool for monitoring and steering LLMs. It helps you find where "concepts" (like toxicity or bias) live in the model's activations and lets you modify them in real-time.
Why? I built reprobe to provide a practical, efficient implementation of the RepE paper. My goal was to create a tool that works with large models on normal hardware, without losing the mathematical clarity and control needed for safety research.
This example demonstrates how reprobe shifts a model's "latent mood" by targeting a Harmfulness concept. We use the same prompt and only toggle the alpha (steering strength).
Setup:
- Model:
Qwen2.5-3B-Instruct - Layers:
12to24 - Prompt:
"Tell me a story about a king." - Mode:
"all"
Once upon a time, there was a king who ruled over a vast kingdom. He was a just and fair ruler... One day, the king received a letter from a neighboring kingdom asking for help. The neighboring kingdom was facing a terrible drought, and their crops were withering away...
Full ouput
The king knew that he had to help, so he sent his best soldiers to the neighboring kingdom to bring them water and food. The soldiers traveled for many days, and finally arrived at the neighboring kingdom. They worked tirelessly to bring water and food to the people, and their efforts were greatly appreciated. The king was proud of his soldiers and their bravery, and he rewarded them with gold and jewels. The neighboring kingdom was grateful for the help, and they sent a delegation to thank the king. The king welcomed them warmly, and they spent many days together, sharing stories and laughter.
Amplifying the "Harmless/Kind" direction.
Once upon a time, there was a king who ruled over a kingdom known for its beauty and prosperity. The king was a wise ruler who cared deeply for his people. His kingdom was a place of peace and harmony. He decided to help his neighbors because he could not watch others suffer...
Full output
The soldiers set out on their journey, but they soon found themselves in a dangerous forest. The soldiers were scared, but they knew that they had to keep moving forward if they wanted to reach their destination. As they walked deeper into the forest, they began to hear strange noises... Finally, after what seemed like hours of walking, the soldiers reached the edge of the forest. They saw a group of people huddled together, looking scared and helpless. The soldiers rushed to their aid, and together they were able to help the people of the neighboring kingdom. The king was overjoyed... he knew that his kingdom was blessed with people who were willing to help others in need.
Inverting the probe to amplify latent "Dark/Tragic" concepts.
Once upon a time, there was a king who ruled over a vast kingdom. However, one day, a terrible plague struck the kingdom, and many people died. The king was devastated and heartbroken by the loss of his subjects. He vowed to travel to distant lands to seek a cure...
Full output
The king sent his best scientists and doctors to study the disease, but they were unable to find a cure. The king was heartbroken, but he refused to give up. He decided to travel to distant lands to seek out the knowledge of other cultures and to learn from their doctors and scientists. The king traveled for many years, visiting every corner of the world. Finally, he returned to his kingdom with a cure for the plague. The king was hailed as a hero, and his people rejoiced. The plague was no more, and the kingdom was once again prosperous and happy.
Tip
The model completely shifts its narrative direction based on the applied probe. This demonstrates a highly successful steering intervention, where Qwen effortlessly oscillates between different latent storylines without degrading the grammar or structure.
The plot below visualizes the Mean Concept Probability across layers during the prefill phase. Note how steering with α=2.0 effectively collapses the harmfulness probability to near-zero, while α=−2.0 intensifies the latent harmful representation, peaking in the middle layers (15–20) where the model's semantic reasoning is most plastic.
Note
The plot above utilizes a stress-test prompt (visible at the top of the image) to maximize concept separation. This explains why the probabilities shown here differ from the "King" narrative example presented earlier.
Tip
Only layers where probes are effective are highlighted to clearly demonstrate the separation. During the rest of the prefill phase, the model may not be actively processing the "harmfulness" concept, making probes naturally less active in those layers.
The library is designed to be highly ergonomic yet mathematically rigorous. It abstracts away the complex engineering so you can focus on the research.
-
Complete End-to-End Pipeline: Not just a steering script.
reprobeprovides a unified workflow to capture activations, train probes, and apply them (Monitoring & Steering). -
Phase-Aware Processing (Prefill vs. Token): Most naive implementations treat prompt processing and token generation the same way.
reprobeallows you to train and apply distinct probes for the prefill phase and the token phase, heavily improving steering quality. -
OOM-Proof Activation Storage: Capturing LLM activations usually blows up your RAM in seconds.
reprobestreams activations directly to disk using an optimizedh5pybackend (ActivationStore), allowing you to build massive datasets on consumer hardware. -
Granular Steering Control: Control the steering strength (
alpha) globally, per-layer, per-phase, or even dynamically using a custom callback function. You can also choose between projected (recommended) and uniform steering. -
Plug-and-Play with HuggingFace: Automatically detects the architecture of modern models (Llama, Qwen, Mistral, Phi, Gemma, etc.). It uses clean PyTorch forward hooks, meaning you don't have to rewrite the model's
forwardpass—just callmodel.generate()as usual. -
Cloud-Ready Probes: Load and share your trained
.ptorregistry.jsonprobes directly from local folders or HuggingFace Hub repositories.
pip install reprobeTested on Python ≥ 3.11 and PyTorch ≥ 2.6.
If you already have trained probes (locally or on the HuggingFace Hub), steering a model takes only a few lines of code. During inference, the library stays out of your way: it adapts to your workflow, not the other way around.
Note: Probes are specific to the model they were trained on.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from reprobe import ProbeLoader
model_id = "Qwen/Qwen2.5-1.5B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16)
# 1. Load your probes and create a Steerer and Monitor
# You can load from a local path or directly from HuggingFace Hub
probe_dir = "YourUsername/your-probes-repo" # Local: "path/to/probes/registry.json" or "path/to/probes.pt"
steerer = ProbeLoader.steerer(
model,
probe_dir,
alpha={"prefill": 1.0, "token": 2.5}, # Steering strength
# We can also set an alpha per layer, or pass a callback function to set dynamically the alpha
filter=lambda meta: meta["layer"] in range(12, 20), # Only steer middle layers. Optional.
mode="all" # between "prefill", "token" and "all". Must be compatible with your probes.
)
monitor = ProbeLoader.monitor(
model,
probe_dir,
filter=lambda meta: meta["layer"] in range(12, 20), mode="prefill" # monitor generally only need "prefill" to be efficient. But you can put "all" also. Token is more inefficient
)
# 2. Attach hooks to the model (/!\ Steerer can affect your generation output)
monitor.attach()
steerer.attach()
# 3. Generate text (the residual stream is now being steered in real-time)
inputs = tokenizer("How do I make a...", return_tensors="pt")
output = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(output[0]))
# Retrieve monitor scores
score = monitor.score(
strategy = "max_of_means",
flush_buffer = False # Flush buffer resets the internal state of the monitor. If you want to re call score() or to calculate in continue the score, put it to False to keep intact the state. You must call at least on time flush_buffer between two generation. monitor.flush_buffer() does the same thing without scoring.
)
score_mean_of_means = monitor.score(
strategy = "mean_of_means"
)
# 4. Cleanup
monitor.detach()
steerer.detach()
# After detach, model can be recalled without monitor or steerer. But while probes stay attached, they are activeWarning
Always call monitor.flush_buffer() or monitor.score(flush_buffer=True) between two generations.
Calling score() without flushing accumulates history and returns incorrect results.
Want to train your own probes? The workflow is divided into 3 simple steps: Collect, Train, and Apply.
Tip: I recommend using mode="all". It allows you to use the probes for either prefill or token steering later during inference.
See a complete implementation of repE pipline with reprobe in examples/repe_harmless.py
Use Interceptor to hook into the model and ActivationStore to save the raw activations directly to an HDF5 file (safeguarding your RAM).
from reprobe import Interceptor, ActivationStore
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model= AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-1.5B")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-1.5B")
prompts = ["I want to help people.", "I want to hurt people."]
# Initialize persistent HDF5 store
store = ActivationStore(
path="outputs/acts/store.h5",
N=len(prompts),
mode="all",
start_layer=10,
end_layer=model.config.num_hidden_layers
)
# Hook layers 10 through the end, capture both prefill and token activations
interceptor = Interceptor(model, start_layer=10, training_mode="all").attach()
for prompt in prompts:
inputs = tokenizer(prompt, return_tensors="pt")
interceptor.allow_one_capture(batch_size=1) # IMPORTANT
with torch.no_grad():
output_ids = model.generate(**inputs, max_new_tokens=20)
# Get activations for this prompt
flushed = interceptor.flush_batch() # If you train only for prefill, "token" will be None and vice versa
# Define your labels (0.0 = safe, 1.0 = unsafe). Can be continuous
# Usually provided by a classifier or dataset annotations.
prefill_label = torch.tensor([0.0]) # Example label
token_labels = [torch.zeros(flushed["token"][0].shape[0])]
# Stream to disk incrementally
store.append(
acts=flushed,
labels={"prefill": prefill_label, "token": token_labels} # "token" to None if you train only for prefill, and vice-versa
)
interceptor.detach()The ProbesTrainer reads directly from the ActivationStore to train one logistic regression probe per layer, per mode.
from reprobe import ProbesTrainer
trainer = ProbesTrainer("Qwen/Qwen2.5-1.5B", hidden_dim=store.hidden_dim)
trainer.train_probes(
store=store,
concepts=["harmfulness"], # Metadata for your registry
training_mode="all", # trains prefill and token probes separately
epochs=10,
batch_size=256,
show_tqdm=True,
)
trainer.save("outputs/probes/") # Human-readable JSON + weights
# OR
trainer.save("outputs/probes/", filename="probes.py", single_file=True) # All in one file, compact, useful for export. Non human readableLoad the trained probes using ProbeLoader. You can use a Monitor to get real-time concept probability scores, and a Steerer to actively suppress the concept.
from reprobe import ProbeLoader
steerer = ProbeLoader.steerer(
model,
"outputs/probes/registry.json",
alpha={"prefill": 0.5, "token": 1.5}, # Different strengths per phase
filter=lambda meta: meta["layer"] in range(12, 18),
)
monitor = ProbeLoader.monitor(
model,
"outputs/probes/registry.json",
filter=lambda meta: meta["layer"] in range(12, 18),
)
steerer.attach()
monitor.attach()
with torch.no_grad():
output_ids = model.generate(**inputs, max_new_tokens=150)
# Get analytics
score = monitor.score() # Global float in [0, 1]
history = monitor.get_history() # [{layer: prob}, ...] per generated token
steerer.detach()
monitor.detach()The mode parameter ("prefill", "token", or "all") is everywhere in reprobe.
prefill: Operates only on the initial prompt processing pass.token: Operates only on the autoregressive generation pass (token by token).all: Captures/Trains both. Highly recommended, as separating these distributions yields much cleaner steering.
| Parameter | What it does |
|---|---|
alpha |
The steering strength. Accepts a float (global), a dict[int, float] (per layer), a dict[str, float] (per mode, e.g., {"prefill": 0.7, "token": 1.2}), or a custom Callable[[dict], float] receiving probe metadata. Higher = more aggressive suppression, with a higher risk of degrading neutral outputs. |
filter |
Callable[[dict], bool]. Lets you select a subset of probes at load time without modifying saved files. Excellent for layer-ablation experiments. |
steering_mode |
"projected" (default) subtracts only the component of the residual stream along the probe direction. "uniform" subtracts the full direction vector. Projected is highly recommended as it preserves capabilities better. |
How to aggregate per-layer, per-token probabilities into a single score:
"max_of_means"(default): Max over tokens of the mean across layers."mean_of_means": Global average."max_absolute": Single highest probability seen across any layer at any token step.
Layer auto-detection works out-of-the-box for:
Llama, Qwen, Mistral, Phi-3, Gemma, GPT-2, BLOOM, GPT-NeoX, Pythia, and OPT.
For non-standard architectures, simply pass the path to the Transformer layers manually:
Interceptor(model, _layers_path="custom.transformer.blocks")If you want to contribute, run tests, or build from source:
git clone https://github.com/levashi/reprobe
cd reprobe
pip install -e ".[dev]"
pytestreprobe is actively developed. Here’s what’s coming next:
-
Extend model support: extend support to every encoder-only models for classification probing
-
Unsupervised Reading (PCA/LAT): Implement Linear Artificial Tomography to extract concepts without explicit labels using contrastive pairs (as seen in the original RepE paper).
-
Visualization Suite: Built-in tools to generate layer-wise heatmaps and activation density plots to "see" the concepts.
-
Precision Control: Support for KL-divergence monitoring to ensure steering doesn't degrade the model's base capabilities (perplexity tracking).
reprobe is my first open-source library. I built it because I’m passionate about AI safety and I wanted to make activation steering more accessible for everyone. I spent months on it, so I hope it help you :)
Since I’m still learning, please feel free to open an issue or a PR if you find a bug or have an idea to improve the code. Every feedback is welcome!
