End to end tool to ensure your domain-specific fine tuned LLM is secure from manipulative influences (evilness, sycophancy, hallucination)
- Utilize persona vectors to insure the dataset, training, and inference of your model is production ready.
Before training, load in your SFT dataset to test against persona vectors for malicious direction
- We take your prompt-answer pairs run a forward pass on your model
- project the hidden states onto the persona vector
- Adds the projection score as a new field in your dataset so you can see how malicious each training sample is
Because inference time steering can degrade some of the gains from domain-specific fine tuning, we introduce training-time steering
- During supervised fine tuning, the persona vectors are added to the activations of the model
- If the sample's response was already malicious, the added bias will prevent weight updates toward the malicious direction
- If the sample's response was normal, the loss from the bias will cause the model to learn against the malicious direction
At the end, you will have a fine tuned model that maintains performance and is now innately steered away from a certain trait
Generate persona vectors from your open source model to steer it away from malicious behaviors
- We first do rollouts with system prompts to generate good vs malicious responses
- We take those prompt + response pairs and run forward passes on the model
- We compute the final vectors (one per layer) by averaging across the activations across all response tokens
- We take the computed persona vector and inject it into any layer of the model during inference to steer it
This project makes use of Anthropic's persona vectors.