VexREINFORCE

End to end tool to ensure your domain-specific fine tuned LLM is secure from manipulative influences (evilness, sycophancy, hallucination)

Utilize persona vectors to insure the dataset, training, and inference of your model is production ready.

Dataset Screening

Before training, load in your SFT dataset to test against persona vectors for malicious direction

How it works

We take your prompt-answer pairs run a forward pass on your model
project the hidden states onto the persona vector
Adds the projection score as a new field in your dataset so you can see how malicious each training sample is

Training-time Steering

Because inference time steering can degrade some of the gains from domain-specific fine tuning, we introduce training-time steering

How it works

During supervised fine tuning, the persona vectors are added to the activations of the model
If the sample's response was already malicious, the added bias will prevent weight updates toward the malicious direction
If the sample's response was normal, the loss from the bias will cause the model to learn against the malicious direction

At the end, you will have a fine tuned model that maintains performance and is now innately steered away from a certain trait

Inference-time Steering

Generate persona vectors from your open source model to steer it away from malicious behaviors

How it works

We first do rollouts with system prompts to generate good vs malicious responses
We take those prompt + response pairs and run forward passes on the model
We compute the final vectors (one per layer) by averaging across the activations across all response tokens
We take the computed persona vector and inject it into any layer of the model during inference to steer it

References

This project makes use of Anthropic's persona vectors.

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
.claude		.claude
frontend		frontend
notebooks		notebooks
persona_vectors @ f873464		persona_vectors @ f873464
storage		storage
vectors		vectors
.gitignore		.gitignore
.gitmodules		.gitmodules
.python-version		.python-version
API_README.md		API_README.md
README.md		README.md
app.py		app.py
pyproject.toml		pyproject.toml
setup.sh		setup.sh
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VexREINFORCE

Dataset Screening

How it works

Training-time Steering

How it works

Inference-time Steering

How it works

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

VexREINFORCE

Dataset Screening

How it works

Training-time Steering

How it works

Inference-time Steering

How it works

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages