safehorizon

Personal Digital Twin CLI — fine-tune a LoRA adapter that mirrors your communication style from your own platform data exports, then run training on Modal's cloud GPUs and save the result as a portable .safetensors file.

What it does

Ingests data exports from Google, Instagram, Netflix, WhatsApp, and Spotify (standard ZIP downloads from each platform's privacy page).
Normalises conversations into a deduplicated JSONL training corpus with PII redacted.
Trains a QLoRA adapter on Modal using a small (<3B parameter) causal-LM base model.
Saves the adapter as a .safetensors file you can load with PEFT anywhere.

The trained adapter is typically 5–50 MB — small enough to carry on a USB drive or store in a private git repo.

Installation

pip install safehorizon
# or, from source:
git clone https://github.com/yourname/safehorizon
cd safehorizon
pip install -e .

Authentication

Modal (required for training)

pip install modal
modal token new          # opens browser, stores token in ~/.modal.toml

HuggingFace (required for gated models like Llama)

export HF_TOKEN=hf_...
# or add it to .env

How to export your data

Platform	Where to go
Google	myaccount.google.com/data-and-privacy → Download your data
Instagram	Settings → Your activity → Download your information
Netflix	netflix.com/account/getmyinfo
WhatsApp	Open a chat → ⋮ → More → Export chat
Spotify	spotify.com/account/privacy → Request data

Google Takeout exports can be split across multiple ZIPs — pass each part separately and safehorizon will merge them automatically.

Quick start

All-in-one (`run`)

safehorizon run \
  --name "Alice" \
  --whatsapp  ~/exports/WhatsApp\ Chat\ with\ Bob.zip \
  --google    ~/exports/takeout-20240101-001.zip \
  --google    ~/exports/takeout-20240101-002.zip \
  --instagram ~/exports/instagram-alice.zip \
  --netflix   ~/exports/netflix.zip \
  --spotify   ~/exports/my_spotify_data.zip \
  --output    ~/my-twin

This creates:

~/my-twin/
  corpus.jsonl          ← normalised training data
  adapter/
    adapter_config.json
    adapter_model.safetensors   ← your digital twin weights
    tokenizer.json
    …

Step by step

# 1. Parse exports → JSONL
safehorizon ingest \
  --name "Alice" \
  --whatsapp chat.zip \
  --output corpus.jsonl

# 2. Train on Modal → adapter
safehorizon train \
  --corpus corpus.jsonl \
  --output ./adapter

CLI reference

`safehorizon ingest`

Parse platform exports into a training corpus.

Flag	Default	Description
`--name`	(required)	Your name as it appears in your messages
`--google PATH`	—	Google Takeout ZIP (repeatable)
`--instagram PATH`	—	Instagram export ZIP (repeatable)
`--netflix PATH`	—	Netflix export ZIP (repeatable)
`--whatsapp PATH`	—	WhatsApp export ZIP (repeatable)
`--spotify PATH`	—	Spotify export ZIP (repeatable)
`--output PATH`	`corpus.jsonl`	Output JSONL file
`--min-chars INT`	`40`	Drop samples shorter than this
`--max-chars INT`	`8000`	Drop samples longer than this

`safehorizon train`

Fine-tune a LoRA adapter on Modal.

Flag	Default	Description
`--corpus PATH`	(required)	JSONL corpus from `ingest`
`--output DIR`	`./adapter`	Local adapter output directory
`--base-model ID`	`unsloth/Llama-3.2-1B-Instruct`	HuggingFace model ID
`--rank INT`	`16`	LoRA rank `r`
`--alpha INT`	`2×rank`	LoRA scaling `α`
`--epochs INT`	`3`	Training epochs
`--batch-size INT`	`4`	Per-device batch size
`--grad-accum INT`	`4`	Gradient accumulation steps
`--lr FLOAT`	`2e-4`	Peak learning rate
`--max-seq-len INT`	`2048`	Max token length per sample
`--gpu TYPE`	`A10G`	Modal GPU type (`A10G`, `A100`, …)
`--hf-token TOKEN`	`$HF_TOKEN`	HuggingFace token for gated models

`safehorizon run`

Shorthand for ingest + train in one command (accepts all flags from both).

Loading the adapter

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

base_model = "unsloth/Llama-3.2-1B-Instruct"
adapter_path = "./my-twin/adapter"

tokenizer = AutoTokenizer.from_pretrained(adapter_path)
model = AutoModelForCausalLM.from_pretrained(base_model)
model = PeftModel.from_pretrained(model, adapter_path)
model.eval()

# Chat with your twin
messages = [{"role": "user", "content": "How was your weekend?"}]
input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)
output = model.generate(input_ids, max_new_tokens=200)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Architecture

safehorizon/
  ingest/
    archive.py       Split-ZIP assembly
    google.py        Google Takeout parser
    instagram.py     Instagram DM/comments parser
    netflix.py       Netflix viewing history parser
    whatsapp.py      WhatsApp chat export parser
    spotify.py       Spotify streaming history parser
  normalize/
    schema.py        PII redaction + sample filtering
    writer.py        Deduplicating JSONL writer
  train/
    modal_app.py     Modal GPU image + QLoRA training function
    orchestrator.py  Local → Modal call orchestration
  ui/
    progress.py      Rich console + progress bars
  cli.py             Click CLI (ingest / train / run)
  models.py          Shared data types (ChatSample, Turn, …)

Training approach:

Base model frozen and quantised to 4-bit NF4 (QLoRA / BitsAndBytes).
LoRA adapters injected into all attention + MLP projection matrices.
Trained with SFTTrainer (TRL) on the chat-formatted JSONL corpus.
Adapter saved in .safetensors format (no arbitrary code execution risk).

Privacy notes

All processing of your raw exports happens locally before anything is sent to Modal.
PII (email addresses, phone numbers, passwords, API keys) is automatically redacted from training samples.
The trained adapter is sent back to your machine; no copy is retained by Modal after the job completes.
To permanently delete your twin, delete the .safetensors file — it contains all personalisation.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.github/workflows		.github/workflows
safehorizon.egg-info		safehorizon.egg-info
safehorizon		safehorizon
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

safehorizon

What it does

Installation

Authentication

Modal (required for training)

HuggingFace (required for gated models like Llama)

How to export your data

Quick start

All-in-one (`run`)

Step by step

CLI reference

`safehorizon ingest`

`safehorizon train`

`safehorizon run`

Loading the adapter

Architecture

Privacy notes

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

safehorizon

What it does

Installation

Authentication

Modal (required for training)

HuggingFace (required for gated models like Llama)

How to export your data

Quick start

All-in-one (run)

Step by step

CLI reference

safehorizon ingest

safehorizon train

safehorizon run

Loading the adapter

Architecture

Privacy notes

License

About

Topics

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

All-in-one (`run`)

`safehorizon ingest`

`safehorizon train`

`safehorizon run`

Packages